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Preace 


Welcome! 


A few quick words of 

introduction to the book, how 

to get the notebooks and figures, 

and thanks to the people who helped me. 


Preface 


What You'll Get from This Book 


Hello! 


If you're interested in deep learning (DL) and machine learning (ML), 
then there’s good stuff for you in this book. 


My goal in this book is to give you the broad skills to be an effective 
practitioner of machine learning and deep learning. 


When you've read this book, you will be able to: 
¢ Design and train your own deep networks. 
¢ Use your networks to understand your data, or make new data. 


¢ Assign descriptive categories to text, images, and other types of data. 


Predict the next value for a sequence of data. 


Investigate the structure of your data. 

¢ Process your data for maximum efficiency. 

¢ Use any programming language and DL library you like. 

¢ Understand new papers and ideas, and put them into practice. 
¢ Enjoy talking about deep learning with other people. 


We'll take a serious but friendly approach, supported by tons of illus- 
trations. And we'll do it all without any code, and without any math 
beyond multiplication. 


If that sounds good to you, welcome aboard! 


Preface 


Who This Book Is For 


This book is designed for people who want to use machine learning 
and deep learning in their own work. This includes programmers, art- 
ists, engineers, scientists, executives, musicians, doctors, and anyone 
else who wants to work with large amounts of information to extract 
meaning from it, or generate new data. 


Many of the tools of machine learning, and deep learning in particular, 
are embodied in multiple free, open-source libraries that anyone can 
immediately download and use. 


Even though these tools are free and easy to install, they still require 
significant technical knowledge to use them properly. It’s easy to ask 
the computer to do something nonsensical, and it will happily do it, 
giving us back more nonsense as output. 


This kind of thing happens all the time. Though machine learning 
and deep learning libraries are powerful, they’re not yet user friendly. 
Choosing the right algorithms, and then applying them properly, still 
requires a stream of technically informed decisions. When things often 
don’t go as planned, we need to use our knowledge of what’s going on 
inside the system in order to fix it. 


There are multiple approaches to learning and mastering this essential 
information, depending on how you like to learn. 


Some people like hardcore, detailed algorithm analysis, supported 
by extensive mathematics. If that’s how you like to learn, there are 
great books out there that offer this style of presentation [Bishopo6] 
[Goodfellow17]. This approach requires intensive effort, but pays off 
with a thorough understanding of how and why the machinery works. 
If you start this way, then you have to put in another chunk of work to 
translate that theoretical knowledge into contemporary practice. 
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At the other extreme, some people just want to know how to do 
some particular task. There are great books that take this cook- 
book approach for various machine-learning libraries [Chollet17] 
[Muller-Guido16] [Raschka15] [VanderPlas16]. This approach is easier 
than the mathematically intensive route, but you can feel like you’re 
missing the structural information that explains why things work as 
they do. Without that information, and its vocabulary, it can be hard 
to work out why something that you think ought to work doesn’t work, 
or why something doesn’t work as well as you thought it should. It can 
also be challenging to read the literature describing new ideas and 
results, because those discussions usually assume a shared body of 
underlying knowledge that an approach based on a single library or 
language doesn’t provide. 


This book takes a middle road. My purpose is practical: to give you the 
tools to practice deep learning with confidence. I want you to make 
wise choices as you do your work, and be able to follow the flood of 
exciting new ideas appearing almost every day. 


My goal here is to cover the fundamentals just deeply enough to give 
you a broad base of support. I want you to have enough background 
not just for the topics in this book, but also the materials you're likely 
to need to consult and read as you actually do deep learning work. 


This is not a book about programming. Programming is important, 
but it inevitably involves all kinds of details that are irrelevant to our 
larger subject. And programming examples lock us into one library, or 
one language. While such details are necessary to building final sys- 
tems, they can be distracting when we're trying to focus in the big 
ideas. Rather than get waylaid by discussions of loops and indices and 
data structures, we discuss everything here in a language and library 
independent way. Once you have the ideas firmly in place, reading the 
documentation for any library will be a straightforward affair. 


We do put our feet on the ground in Chapters 15, 23, and 24, when 
we discuss the scikit-learn library for machine learning, and the Keras 
library for deep learning. These libraries are both Python based. In 
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those chapters we dive into the details of those Python libraries, and 
include plenty of example code. Even if you’re not into Python, these 
programs will give you a sense for typical workflows and program 
structures, which can help show how to attack a new problem. 


The code in those programming chapters is available in Python note- 
books. These are for use with the browser-based Jupyter programming 
environment [Jupyter16]. Alternatively, you can use a more classical 
Python development environment, such as PyCharm [JetBrains17]. 


Most of the other chapters also have supporting, optional Python note- 
books. These give the code for every computer-generated figure in the 
book, often using the techniques discussed in that chapter. Because 
were not really focusing on Python and programming (except for the 
chapters mentioned above), these notebooks are meant as a “behind 
the scenes” look, and are only lightly commented. 


Machine learning, deep learning, and big data are having an unexpect- 
edly rapid and profound influence on societies around the world. What 
this means for people and cultures is a complicated and important 
subject. Some interesting books and articles tackle the topic head-on, 
often coming to subtle mixtures of positive and negative conclusions 
[Aguera y Arcas 17] [Barrat15] [Domingos15] [Kaplan16]. 
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Almost No Math 


Lots of smart people are not fans of complicated equations. If that’s 
you, then youre right at home! 


There’s just about no math in this book. If you’re comfortable with 
multiplication, you’re set, because that’s as mathematical as we get. 


Many of the algorithms we'll discuss are based on rich sources of theory 
and are the result of careful analysis and development. It’s important 
to know that stuff if you’re modifying the algorithm for some new pur- 
pose, or writing your own implementation. But in practice, just about 
everyone uses highly optimized implementations written by experts, 
available in free and open-source libraries. 


Our goals are to understand the principles of these techniques, how 
to apply them properly, and how to interpret the results. None of that 
requires us to get into the mathematical structure that’s under the 
hood. 


If you love math, or you want to see the theory, follow the references in 

each chapter. Much of this material is elegant and intellectually stim- 
ulating, and provides details that I have deliberately omitted from this 

book. But if math isn’t your thing, there’s no need to get into it. 


Lots of Figures 


Some ideas are more clearly communicated with pictures than with 
words. And even when words do the job, a picture can help cement the 
ideas. So this book is profusely illustrated with original figures. 


All of the figures in this book are available for free download (see 
below). 


vi 
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Downloads 


You can download the Jupyter/Python notebooks for this book, all the 
figures, and other files related to this book, all for free. 


All the Notebooks 


All of the Jupyter/Python notebooks for this book are available on 
GitHub. 


The notebooks for Chapter 15 (scikit-learn) and Chapters 23 and 24 
(Keras) contain all the code that’s presented in those chapters. 


The other notebooks are available as a kind of “behind the scenes” look 
at how the book’s figures were made. They’re lightly documented, and 
meant to serve more as references than tutorials. 


The notebooks are released under the MIT license, which basically 
means that you're free to use them for any purpose. There are no 
promises of any kind that the code is free of bugs, that it will run prop- 
erly, that it won’t crash, and so on. Feel free to grab the code and adapt 
it as you see fit, though as the license says, keep the copyright notice 
around (it’s in the file named simply LICENSE). 


https: //github.com/blueberrymusic/DeepLearningBookCode-Volumel 
https: //github.com/blueberrymusic/DeepLearningBookCode-Volume2 


All the Figures 


All of the figures in this book are available on GitHub as high-res- 
olution PNG files. You‘re free to use them in classes, talks, lectures, 
reports, papers, even other books. 


Like the code, the figures are provided under the MIT license, so you 
can use them as you like as long as you keep the copyright notice 
around. You don’t have to credit me as their creator when you use 
these figures, but I’d appreciate it if you would. 
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The filenames match the figure numbers in the book, so they’re easy to 
find. When you're looking for something visually, it may be helpful to 
look at the thumbnail pages. These hold 20 images each: 


https: //github.com/blueberrymusic/DeepLearningBookFi gures-Thumbnails 


The figures themselves are grouped into the two volumes: 


https: //github.com/blueberrymusic/DeepLearningBookF7i gures-Volumel 
https: //github.com/blueberrymusic/DeepLearningBookF7i gures-VolLume2 


Resources 


The resources directory contains other files, such as a template for the 
deep learning icons we use later in the book. 


https://github.com/blueberrymusic/DeepLearningBook-Resources 


Errata 


Despite my best efforts, no book of this size is going to be free of 
errors. If you spot something that seems wrong, please let me know at 
andrew@dlbasics.com. Ill keep a list of errata on the book’s website at 
https://dlbasics.com. 


Two Volumes 


This ended up as a large book, so I’ve organized it into two volumes of 
roughly equal size. 


Because the book is cumulative, the second volume picks up where the 
first leaves off. If you’re reading the second volume now, you should 
have already read the first volume, or feel confident that you under- 
stand the material presented there. 


Vili 
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Thank You! 


Authors like to say that nobody writes a book alone. We say that 
because it’s true. 


For their consistent and enthusiastic support of this project, and help- 
ing me feel good about it all the way through, I am enormously grateful 
to Eric Braun, Eric Haines, Steven Drucker, and Tom Reike. Thank 
you for your friendship and encouragement. 


Huge thanks are due to my reviewers, whose generous and insightful 
comments greatly improved this book: Adam Finkelstein, Alex Colburn, 
Alexander Keller, Alyn Rockwood, Angelo Pesce, Barbara Mones, Brian 
Wyvill, Craig Kaplan, Doug Roble, Eric Braun, Eric Haines, Greg Turk, 
Jeff Hultquist, Jessica Hodgins, Kristi Morton, Lesley Istead, Luis 
Avarado, Matt Pharr, Mike Tyka, Morgan McGuire, Paul Beardsley, 
Paul Strauss, Peter Shirley, Philipp Slusallek, Serban Porumbescu, 
Stefanus Du Toit, Steven Drucker, Wenhao Yu, and Zackory Erickson. 


Special thanks to super reviewers Alexander Keller, Eric Haines, 
Jessica Hodgins, and Luis Avarado, who read all or most of the manu- 
script and offered terrific feedback on both presentation and structure. 


Thanks to Morgan McGuire for Markdeep, which enabled me to focus 
on what I was saying, rather than the mechanics of how to format it. It 
made writing this book a remarkably smooth and fluid process. 


Thanks to Todd Szymanski for insightful advice on the design and lay- 
out of the book’s contents and covers, and for catching layout errors. 


Thanks to early readers who caught typos and other problems: 
Christian Forfang, David Pol, Eric Haines, Gopi Meenakshisundaram, 
Kostya Smolenskiy, Mauricio Vives, Mike Wong, and Mrinal Mohit. 


All of these people improved the book, but the final decisions were my 
own. Any problems that remain are my responsibility. 
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Chapter ZO 


Deep Learning 


We'll see the basic structure of deep 
learning networks, and survey many 
of the types of layers they’re built from. 
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20.1 Why This Chapter Is Here 


In previous chapters we've laid a lot of groundwork for the algorithms 
that go into constructing neural networks. In this chapter we'll pull 
those threads together and look at building a network by assembling 
artificial neurons into layers. As we saw in Chapter 18, this lets us 
use the efficient backpropagation algorithm to improve the network’s 
performance. A network built from a series of layers is often called a 
deep network, and when such networks learn from data we provide, 
we call this deep learning. 


We'll discuss the terminology of deep learning, and look at many of 
the most popular layers that are used in deep-learning networks. We'll 
look at some example networks, and then consider how to build a new 
one and interpret its results. 


This chapter sets up the remainder of the book, where we look at dif- 
ferent specialized forms of deep learning for different tasks. 


20.2 Deep Learning Overview 


Neural networks built from a stack of layers are often called deep 
networks (they could have also been called “tall” or “wide”, or “long” 
networks, but “deep” is the direction that stuck). When we're using 
deep networks we generally say we’re doing deep learning. 


The phrase deep learning usually refers to neural networks that are 
arranged in stacks of layers. The more general phrase machine learn- 
ing usually refers to both deep learning, and other algorithms we’ve 
seen (like the classifiers in Chapter 13) that are not based on neural 
networks. But some authors treat “machine learning” and “deep learn- 
ing” as two different fields, so that “machine learning” refers only to 
those algorithms that don’t use neural networks. Thus there are books 


874 


Chapter 20: Deep Learning 


with “machine learning” in the title that include neural networks, and 
others that don’t. It’s always worth a moment to work out which way a 
particular author is using this phrase. 


A result of organizing neurons in layers is that a deep learning network 
is able to analyze data hierarchically. The early layers look at the 
raw data, and each subsequent layer is able to use information from 
neurons on the previous layer to process larger chunks of data. For 
example, when considering a photograph, the first layer usually looks 
at the individual pixels. The next layer would look at groups of pixels, 
the one after that at groups of those groups, and so on. Early layers 
might notice that some pixels are darker than others, while later layers 
might notice that a clump of pixels looks like an eye, and a much later 
layer might identify the stripes that reveal that the whole image shows 
a tiger. 


Figure 20.1 shows an example of a deep learning architecture using 3 
layers. 
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Outputs 





Inputs 


Figure 20.1: A deep learning network. We have 4 inputs flowing through 
3 layers, creating 3 outputs at the end. We say that this network is 
“fully-connected” because every neuron in each layer receives input from 
every neuron on the previous layer. 


When we draw the layers vertically, as in Figure 20.1, the inputs are 
almost always drawn at the bottom, and the outputs where we collect 
our results are almost always drawn at the top. 


The topmost layer (layer 3 in Figure 20.1) is called the output layer. 
Although we might further process the values that come out of this 
layer before using them, for example by using the softmax technique 
we saw in Chapter 17, we usually consider this layer this the end of the 
neural network, because it contains the final set of neurons. 


We would probably expect there to be a corresponding input layer at 
the start, and it would be natural to assume that’s the name we’d give to 
layer 1 in Figure 20.1. But that’s not how the terminology has evolved. 
There is an “input layer,” but it’s rarely shown explicitly. Rather, the 
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input layer refers to the memory that holds the input values. We can 
think of the row of black arrows at the bottom of Figure 20.1 as the 
input layer. 


Layer 1 and Layer 2 in Figure 20.1 are called hidden layers. If we 
imagine someone looking at this network from the outside from above 
or below, they’d see only the input layer or the output layer. We imag- 
ine that the layers in-between are “hidden” from view, and thus are 
called “hidden layers” (they could be seen from the sides, but we over- 
look that for this bit of terminology). 


Sometimes the stack is drawn left to right, as in Figure 20.2. 


Outputs 





Figure 20.2: The same deep network of Figure 20.1, but drawn with data 
flowing left-to-right. 


Even when drawn this way, we still use terms that refer to the vertical 
orientation. Authors might say that Layer 2 is “above” Layer 1, and 
“below” Layer 3. We can always keep thing straight regardless of how 
the diagram is drawn if we think of “above” or “higher” referring to a 
layer closer to the outputs, and “below” or “lower” meaning closer to 
the inputs. 
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20.2.1 Tensors 


Although the mechanics of deep learning networks are fundamentally 
about the manipulation of numbers, an important conceptual organi- 
zation of data is the list of numbers. This list might be one-dimensional, 
as in Figure 20.3(a), simply identifying one number after another. 


s[efa[e[s[e 
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(a) (b) (c) 








Figure 20.3: Three tensors, each with 12 elements. (a) A 1D tensor is a list. 
(b) A 2D tensor is a grid. (c) A 3D tensor is a volume. In all cases, and in 
higher-dimensional cases as well, the entire structure is filled in. That is, 
all rows, columns, etc. have the same length. 


But we can organize our data into other shapes. For instance, a two-di- 
mensional list, as in Figure 20.3(b), is perfect for storing the pixel 
values in an image. We can call this a grid or matrix. A three-dimen- 
sional list, as in Figure 20.3(c), can store volumetric data, or perhaps 
samples that each are made up of multiple features measured at mul- 
tiple times. We can call this a volume or a block. 


To simplify discussions, we refer to a list of any size and number of 
dimensions as a tensor (pronounced ten’-sir). The word “tensor” has 
a more complex meaning in some fields of math and physics. Here we 
use it just to mean a collection of numbers organized into a multidi- 
mensional list. 


So we often refer to the “input tensor” (meaning all the input values), 
the “output tensor” (meaning all the output values), and other tensors 
that are internal to the network as it computes new representations of 
the input data. 


We say that every tensor has a number of dimensions, and a size in 
each dimension. Taken together, these provide the shape of the tensor. 


878 


Chapter 20: Deep Learning 


20.3 Input and Output Layers 


Most networks have a single input layer and a single output layer. 
These labels simply refer to the position of the layer in the stack: the 
input is at the start (the bottom, or left) and the output is at the end 
(the top, or right). 


As we discussed earlier, the input layer is not a layer of neurons. It’s 
just a conceptual placeholder for the input data. The input layer is 
usually created and maintained for us automatically by deep learning 
libraries, and we rarely deal with it directly. We need to be aware of it, 
though, because sometimes we want to process our inputs before the 
rest of the network gets to see them, so we place some kind of process- 
ing step between the input layer and the first layer of neurons in the 
network. 


In contrast, the output layer does contain neurons, and we create it 
explicitly when we build our network. Its type and structure are entirely 
up to us. There’s often no formal definition for the output layer. Rather, 
whatever layer we place at the top of our stack is the one we call the 
“output layer” for that architecture. 


20.3.1 Input Layer 


The input layer is usually not shown in deep-learning architecture 
diagrams. It is merely some memory that holds the input values. Note 
that these are not neurons, since the input layer does no processing. It 
can be thought of as simply as a collection of chunks of memory, each 
capable of holding one number of the input. Figure 20.4 shows the 
idea. 
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Input Layer 





Figure 20.4: The input layer is just a placeholder where we can tempo- 
rarily store the input data. 


Some authors casually use the term “input layer” to mean the first layer 
of processing in a network, so we need to stay alert when we encounter 
terms like “first layer” and “input layer” to be sure in just what sense 
they are being used. 


20.3.2 Output Layer 


The output layer is where the network’s results are communicated to 
the world outside of the network. 


When we build our architecture, we choose the number of neurons in 
the output layer to match the type of problem we're trying to solve. 


If it’s a regression problem with just a single numerical output, then 
there will be just one neuron in the output layer, and that neuron’s 
value is our prediction. 


If we’re building a binary classifier, then we have a choice. We can use 
just one neuron with output values from, say, 0 to 1. Then values near 
oO mean the input is from one class, while values near 1 mean the input 
is of the other class. Alternatively, we can have two output neurons, 
one for each category. We'll typically find which neuron has the larger 
value, and assign the corresponding category to the input. 


A multi-class classifier will typically have as many outputs as there are 
classes. For instance, suppose we're trying to recognize lower-case let- 
ters of the Roman alphabet. We would then have 26 output neurons, 
one for each letter, providing us with a score for each letter. We can 
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choose the output with the highest score as the best choice for the cat- 
egory of the input. Figure 20.5 shows the idea. If we want to interpret 
these outputs as probabilities, we can pass them through a softmax 
step, as discussed in Chapter 17. 


3 1.6 1 0.2 7.8 0.5 
oe so a as Or “Q” “PR” “S” 
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Figure 20.5: If we're categorizing individual letters, we might have 26 
outputs, each one giving us a score for that letter. Here, the letter “R” has 
the largest score shown. 


20.4 Deep Learning Layer Survey 


Most libraries offer a wide variety of layer types. This section looks 
at some of the most common and useful ones. Because the nature of 
each library can influence which layers it provides and how they work, 
it will be easiest to focus on one library as an example. We'll base our 
discussion on Keras [Kerasi6], because it offers a nice selection, and 
we'll be covering it in more detail in Chapters 23 and 24. Even within 
that library, our survey will not be exhaustive. 


We'll summarize here just the basic structure and function of each 
layer. Most layers have optional parameters which can be used to tune 
their behavior if we don’t like the defaults. 


One option that is available on many processing layers is the choice of 
activation function that should be applied to the output of the neurons. 
Recall from Chapter 17 that activation functions are the small non-lin- 
ear transformations that we apply to the output of each neuron before 
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that value is passed on. Although in theory one could apply a different 
activation function to each neuron in the layer, that is rare. In prac- 
tice we usually apply the same activation function to every neuron in a 
given layer. 


The following survey is deliberately brief. We'll return to some of these 
layers with entire chapters dedicated to their principles and uses. The 
others we'll discuss in more detail as we use them. 


20.4.1 Fully-Connected Layer 


A fully-connected layer (also called an FC or dense layer) is a set 
of neurons that each receive an input from every neuron on the previ- 
ous layer. For example, if there are 4 neurons in the dense layer, and 
4 neurons in preceding layer, then each neuron in this layer will have 
4 inputs, one from each neuron in the preceding layer, for a total of 
4x4=16 connections. 


Figure 20.6(a) shows a diagram of a fully-connected layer with 3 neu- 
rons, coming after a layer with 4 neurons. 























(a) (b) 


Figure 20.6: A fully-connected layer. (a) The colored neurons make up a 
fully connected layer. Each of the neurons in the upper layer receives an 
input from every neuron in the previous layer. (b) Our schematic symbol 
for a fully-connected layer. 
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Figure 20.6(b) shows a schematic shorthand that we'll use for dense 
layers. The idea is that there are two neurons at the top and bottom of 
the symbol, and the lines are the four connections between them. Next 
to the symbol we identify how many neurons are in the layer. When it’s 
relevant, this is also where we identify that layer’s activation function. 


Dense layers are used in many places in deep learning. Some networks, 
called fully-connected networks, or multi-layer perceptrons 
(or MLPs), are made up of nothing but a stack of dense layers. 


20.4.2 Activation Functions 


Many layers allow us to specify an activation function that should be 
used for all neurons on that layer. The choice often comes from a list 
the covers many of the activation functions we saw in Chapter 17, such 
as ReLU, sigmoid, and tanh. 


But we can choose instead to create our neurons with no activation 
functions, and then place an activation function “layer” after them. 
The neurons themselves will have no activation function on them, but 
then as their outputs flow into the activation layer, the activation func- 
tions get applied. This has the same result as identifying the activation 
function at the time of creating the layer, but it lets us break up the two 
steps if that’s a more convenient way to think about what we’re doing. 


When working with classification problems, we'll often end with a 
dense layer that has as many outputs as there are categories. Then we 
follow that with a softmax layer, which we saw in Chapter 17. This 
does some scaling to our outputs, and makes sure that they all add up 
to 1. Together, these steps allow us to interpret the output of the soft- 
max step as probabilities. Figure 20.7 shows the idea. 
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Figure 20.7: The softmax operation changes a list of numbers so that they 
represent probabilities. Using a mathematical transformation, the result 
is that every output is from O to 1, and all the outputs taken together sum 
to 1.In this example, we have five values coming out of a fully-connected, 
or dense, layer. Their values are between 0.01 and 1.5, and they sum up to 
about 3.8. After the softmax, the values are from O to 1 and add up to 1. 


20.4.3 Dropout 


Overfitting is a problem for many neural networks. As soon as a net- 
work starts to memorize the training data, and is therefore overfitting, 
we typically stop training. Any techniques we can use that delay the 
onset of overfitting is a type of regularization. Regularization meth- 
ods are great because they allow us to train our networks for longer 
before they overfit, giving us better performance. 


One such technique for delaying overfitting is called dropout, and it 
can be used in a deep network with the inclusion of a dropout layer 
[Srivastavai14]. The dropout layer doesn’t contain any neurons. Unlike 
the softmax layer, the dropout layer doesn’t even do any computing. 
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Instead, it just temporarily disconnects some of the neurons on the 
previous layer. This layer is only active during training. When we use 
the network for predicting, the dropout layers have no effect. 


The dropout layer takes a parameter that describes the percentage of 
neurons that should be affected. At the start of each epoch, we ran- 
domly choose that percentage of neurons on the preceding layer, and 
temporarily disconnect their inputs and outputs from the other neu- 
rons. Effectively, these neurons are left stranded, each one an isolated 
island. Since theyre disconnected, these neurons don’t participate in 
any calculations or updates. When the epoch is done and the weights 
have been updated, the neurons and all of their connections are 
restored. 


For example, suppose the dropout percentage is 20% (a common 
value), and the preceding layer has 100 neurons. Then at the start of 
each epoch we pick 20 neurons at random (since 20% of 100 is 20), 
and they are temporarily disconnected from the network. 


When the epoch is over, we restore those neurons and their connec- 
tions, just as if they'd never been disconnected. Then at the start of 
the next epoch we choose new random set of neurons and temporarily 
remove those, repeating the process for each epoch. Figure 20.8 shows 
the idea graphically. 
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(a) (b) 


Figure 20.8: How dropout works. Here we're applying dropout to the 
middle layer. (a) Here we're not drawing the dropout layer explicitly, but 
just showing its effect. We're set the dropout percentage to 0.5, so 50% 
of the 4 neurons in the middle layer are chosen to be disconnected before 
the epoch begins. They’re shown here in gray. In effect, they’re not part 
of the network. When the epoch is over, all of their input and output 
connections are restored. (b) Our schematic for a single dropout layer is 
a diagonal slash from lower-left to upper-right. To the right we indicate 
the proportion of neurons that are selected for disconnection. 


The intention behind dropout is to prevent any of our neurons from 
over-specializing. Suppose that one neuron in a photo-classification 
system gets highly specialized to detect, say, the eyes of cats. That’s 
useful for recognizing picture of cat’s faces, but useless for all the other 
photographs the system might be asked to classify. 


This kind of specialization easily leads to overfitting. If the various neu- 
rons in a network all get really good at finding just one or two features 
in the training data, then they can perform beautifully on that that data, 
because they spot the idiosyncratic details that they’re trained to locate. 
But the system as a whole will do badly when presented with new data 
that’s missing the precise cues those neurons became specialized for. 
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To avoid this kind of specialization from leading to overfitting, we want 
to avoid the specialization in the first place. That’s what dropout does 
for us. 


By removing neurons at random, sometimes a specialized neuron will 
be chosen. This means that the neurons that remain are forced to step 
in and take on some of the responsibility of that lost neuron. When 
the specialized neuron is reconnected, its specialized response is no 
longer needed as much, and so it, too, is free to become more general. 
Both of these steps lead to neurons that are more generalized in their 
responses, and thus less prone to overfitting. 


In other words, dropout helps put off overfitting by spreading around 
the learning among all the neurons. 


A dropout layer can follow any layer with neurons. 


20.4.4 Batch Normalization 


Another regularization technique is called batch normalization, 
often referred to simply as batchnorm [loffe15]. Like dropout, batch- 
norm can be implemented as a layer that we include in our network, 
but this layer also doesn’t contain neurons. Unlike dropout, batch- 
norm actually does perform some computation, though there are no 
parameters and nothing for us to control. 


Batchnorm is used to modify the values that come out of a computa- 
tional layer, such as a fully-connected layer, or one of the layers we'll 
see below. This might seem strange, since the whole purpose of our 
layers is to learn to produce output values that will lead to good results. 
Why would we want to modify those outputs? 


It turns out that some modifications to the outputs of a layer can make 
those numbers a better fit with the computations yet to come. For 
instance, suppose we were to divide all the values coming out of a layer 
by some fixed number, say 2. Then every value that goes on to the 
next layer would be half of what it would otherwise be. Because all the 
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values are being divided by 2, it changes their absolute sizes, but not 
their relative sizes. So a value that is 3 times greater than some other 
value will still be 3 times greater. 


When we work through the math, this change in scale doesn’t change 
the relative outputs. For instance, if we’re doing classification, the 
neuron with the largest score in a network that doesn’t do this division 
will still have the largest score after we include this scaling operation, 
so we'll still identify the same category for the input. 


So what’s the point of such a scaling? It’s to keep the values flowing 
through the network from growing too large. Recall that in our dis- 
cussion of regularization in Chapter 9, we saw that keeping the values 
small helps to put off overfitting. 


The technique that batchnorm uses to scale the values is similar to 
ideas we saw in Chapter 12. There, we saw that when we prepare data 
for machine learning, we often want to normalize it. That is, we move 
and scale the values so that they have an average of 0 and a standard 
deviation of 1. 


The general intuition of batchnorm is that if it’s a good idea to normal- 
ize data for the input layers (and it is), then it would be a good idea to 
normalize data for internal layers as well. So batchnorm does just that, 
moving and scaling the data coming out of one layer so that it has an 
average of 0 and a standard deviation of 1, giving it just the right prop- 
erties so the next layer can handle it easily and efficiently. 


That explains the “norm” in “batchnorm.” The “batch” part comes 
about because we apply this step after collecting up an entire batch’s 
worth of output from the layer we’re following. As we discussed in 
Chapter 8, in practice, this “batch” almost always refers to a mini- 
batch, far smaller than the entire training set. 
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We use the technique by placing a batchnorm layer after a computa- 
tional layer but before its activation function. So we first create our 
computational layer with no activation function, follow that with a 
batchnorm layer, and then follow that with the activation function 
layer, as in Figure 20.9. 


ReLU activation 
function 
fully-connected 


layer with ReLU 
activation function 


ReLU 


No AF 


XO 


fully-connected 
layer with no 
activation function 











(a) (b) (C) 


Figure 20.9: How to apply a batchnorm layer. (a) The “before” picture, 
showing a fully-connected layer with a ReLU activation function that 
wed like to regularize with batchnorm. (b) The “after” picture. We move 
the activation function into its own layer, and place a batchnorm layer 
in between. We then follow that layer with a batchnorm layer, and then 
follow that with the ReLU activation we had before. (c) Our schematic 
version of the network in (b). We begin with a fully-connected layer with 
no activation function, follow it with the batchnorm icon, and then a 
ReLU activation function layer. The batchnorm icon suggests that the 
data, represented by the black circle, is normalized, because it’s centered 
in the larger circle and nicely sized there. 


So a batchnorm layer collects together all the values that flow out of 
a layer over the course of a batch. Then it normalizes those values, so 
they have a mean (or average) of 0 and a standard deviation of 1. 
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This prevents the values coming out of the layer from drifting into very 
positive or very negative regions, or spreading out (or compressing) 
too much. The action helps keep the values closer to the range where 
the activation function has its most useful non-linearities. 


The upshot of this is that it defers the onset of overfitting, allowing us 
to train longer. 


20.4.5 Convolution 


A convolution layer is most famously used for processing 2D images. 
For example, neural networks based on convolution are used to iden- 
tify faces in photographs. Convolution is also useful for 1D sequences, 
3D volumes, and even higher-dimensional data. We’ll look at convolu- 
tion more closely in Chapter 21. 


For the moment, let’s just get a general sense of the big picture. We'll 
consider convolution on a 2D image. In addition to this input image, 
we'll also create a second, tiny image, perhaps as small as 3 by 3 pixels. 
We'll call this little image a filter. 


Now we can move that tiny square filter over the entire input image. 
For every pixel in the input, we'll center our 3 by 3 image over it, and 
we'll multiply together the value of each pixel in the 3 by 3 filter with 
the value of the pixel in the input image below it. We'll add up those 
values, and that becomes the value of that pixel in an output image. 


If we have two filters, then we’ll produce two output images, one for 
each filter. We could have 3 filters or 300, and each one will follow the 
same process and produce an output image. 


The intuition behind this process is that each filter is “looking for” a 
specific feature in the image, like a tiger’s stripes or a person’s birth- 
mark. Because the filters are moving over the entire image, they can 
find the elements they’re looking for anywhere in the image. If we use 
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multiple convolution layers, one after the other, they can work hierar- 
chically, with each layer using the results of the previous layers to help 
it look for larger patterns. 


Figure 20.10 shows an example of a 2D convolution layer in action for 
a 5 by 4 image and two different 3 by 3 filters. We place the center of 
each 3 by 3 filter over each pixel in the 5 by 4 image in turn, multiply 
each pixel in the filter by the value in the image under it, and add up 
all those results to get the value for that pixel in the output for that fil- 
ter. For now, assume that if any pixels in the filter fall off the sides of 
the input image, we'll just use a Oo for missing input image value. 


LEB, 
(- fille —~ 
LF 47 
EO 








2 x (3,3) 





(b) 


Figure 20.10: A convolution layer applies 1 or more smaller images to an 
input image. Here we have a 5 by 4 starting image, and two 3 by 3 filters. 
(a) We move the red 3 by 3 filter over the input image, placing the center 
of the filter over each pixel in the input. We multiply its values by the 
values of the pixels under it. Adding up the results gives us the value of 
the pixel in the red output image. We do the same process for the blue 
filter, producing the blue output image. (b) Our symbol for a convolution 
layer is a small box inside a larger one, meant to suggest the small image 
that is moved over the larger image. To the right we indicate how many of 
these smaller images are used, and their size. When needed, we can also 
indicate the activation function for the layer. 
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We'll look at this process more closely in Chapter 21, which is entirely 
devoted to networks that are based around convolution. 


2D convolution is great for working with images, and it appears fre- 
quently in image-processing networks. Many libraries also offer 
convolution in other numbers of dimensions, such as 1D and 3D. 


20.4.6 Pooling Layers 


A pooling layer lets us change the size of the data flowing through 
a network. This process is often used with images, when we want to 
reduce the size of the image so we can process it more quickly. 


Let’s suppose we're given an input image of size 512 by 512. That’s a 
quarter of a million pixels, which is a lot of data to handle. 


It might be useful to run one layer of the network on the 512 by 512 
image, so it can work with pixel-level details. But then the next layer 
could work with a 256 by 256 version, and then the next layer a 128 by 
128 version, and so on. 


To do this, we can literally reduce the size of the image. Suppose that 
we look at the upper-left 2 by 2 block of pixels, extract the largest 
value, and write that into the upper-left of a new image. Now we move 
two pixels to the right and choose the next 2 by 2 block, and so on, as 
shown in Figure 20.11. 
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Figure 20.11: A pooling layer. (a) When we apply pooling to a 2D image, 
we gather small blocks (often squares) and use some version of their data 
(usually either the average value or the largest one) as the value we place 
into a new, smaller image. (b) Our schematic symbols for pooling. The 
symbol suggests a reduction in the length of the side of an input. The two 
versions distinguish average and maximum pooling. 


The result of this would be an image with sides half the size of the orig- 
inal. If we had used 3 by 3 blocks, the sides would each have been one 
third of the originals. 


Using the largest value from each block is often a good way to produce 
a smaller image. Using the average value of the block is popular as well. 
Most libraries will offer at least these two options. Like convolution 
layers, pooling layers are also often available in multiple dimensions, 
such as 1D, 2D, and 3D. 


Pooling layers are most frequently used after convolution layers, where 
we use them to make smaller and smaller versions of the input image 
using the approach we discussed above. But this approach is falling 
out of favor, because (as we'll see in Chapter 21), convolution layers 
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can reduce the size of their images as they go. Letting the convolution 
layer change the size of the image is often more efficient than using a 
pooling layer, and can even give us better results and faster training. 


20.4.7 Recurrent Layers 


There are lots of interesting sequences in the world. We might have 
stock prices taken over a series of days, notes that make up a song, 
measurements of a piece of equipment, frames making up a movie, or 
words from a piece of spoken or written language. It’s natural to want 
to ask questions about all of these sequences, such as whether they’re 
like some other sequence (e.g., is this book written by the same author 
as this other book?), how they could be expressed in other terms 
(e.g., translation of a string of words into another language), or how 
they’re likely to behave in the future (e.g., what will the stock price be 
tomorrow?). 


Neural networks made of the layers we’ve seen so far can address 
these questions, with the right structure [vandenOord16]. But a tra- 
ditional problem with those networks is that they have no memory, 
which means they’re poor at using context, which is important when 
we want to answer questions about sequences. For example, if we’re 
translating a piece of speech, we might be considering just one word 
at a time. It’s context that lets us consider the words that came before, 
and maybe those that come after, so we can make the best choice of 
translation. 


We can address this lack of memory by replacing our basic artificial 
neuron with a more complex bit of processing called a recurrent 
unit or recurrent cell. We can build deep-learning networks using 
a mixture of standard layers, and layers made up of recurrent cells. If 
the recurrent cells are an important part of the network, we often call 
it a recurrent neural network, or RNN. 
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RNNs can answer all of the questions we asked above, and many oth- 
ers. They’re used for activities ranging from language translation to 
automatic photo captioning, and even the generation of new prose in 
the style of known authors. 


Note that a recurrent cell is unrelated to the concept of recursion, 
which sounds very similar but is a completely different idea. Recursion 
involves a function that calls itself, often with modified arguments. 
Recurrence, on the other hand, refers to a repeating, or recurring, 
action. A recurrent cell repeats a given operation over and over, so it’s 
recurring, not recursing. 


We'll look at recurrent networks in detail in Chapter 22. For now, it’s 
enough to know that they provide us with a flexible way to add mem- 
ory and context to our networks. 


Figure 20.12(a) shows one of the standard ways to draw a recurrent cell, 
along with our schematic symbol for an entire layer of them. We also 
show the symbol for a recurrent cell that returns a sequence, which 
we'll discuss in Chapter 22. 


a 


Figure 20.12: A recurrent cell. (a) A typical way to draw a recurrent cell. 
The black box represents one step of memory. (b) Our schematic for a 
layer of such cells. (c) Our schematic for a recurrent cell that returns a 
sequence. 
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When we insert a recurrent layer into our network, we identify how 
much memory we want our cell to use. We'll see that we have a wide 
range of choices in how we set up and use these layers, so we'll defer a 
detailed discussion until Chapter 22. 


20.4.8 Other Utility Layers 


There are many useful ways we might want to transform our data as 
it flows through our layers. To keep a consistent “stack of layers” met- 
aphor, we can wrap up each of these transformations into a layer of 
their own, and simply add it to the stack as we build our architecture. 


Like the dropout and batchnorm layers we’ve already seen, these lay- 
ers don’t have neurons (or recurrent cells), and don’t contain weights 
that get updated. So these layers don’t learn and change over time. 
They’re just utility functions that help us shape and modify our data as 
it flows from one computational layer to the next. Calling these utility 
operations “layers” may be a stretch, but it’s so convenient to treat our 
networks as stacks of layers that it’s become a standard convention. 


In addition to dropout, batchnorm, and pooling layers, there are sev- 
eral other popular categories of utility layers. Let’s look at them. 


A normalization layer attempts to modify the data as it flows 
through to regularize the network, or keep the weights low so that we 
put off the start of overfitting. The batchnorm layer we saw above is an 
example of a normalization layer. 


A noise layer adds random values to each piece of data that flows 
through it. This can help with cases of overfitting that resist other 
approaches, as it prevents the neurons from getting too proficient at 
always responding the same way to the same input data. 


A reshaping layer lets us change the shape of the tensor that’s 
flowing through it. For example, we might have a 3D input tensor of 
shape 10 by 3 by 5, for a total of 150 elements. We might want to mush 
together the last two dimensions to make a 2D grid that we can treat 
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as an image. We can use a reshaping layer to declare that it should now 
be interpreted as a tensor of size 10 by 15. Remember that this whole 
idea of “shape” is something that guides how each layer interprets the 
input data. The low-level details of how elements get handled for dif- 
ferently-shaped tensors is handled automatically by each layer. 


A cropping layer is particularly useful when working with images. 
It simply extracts a given rectangle of the image and throws away the 
rest (equivalently, we can say that it throws away some of the border 
and keeps the interior). 


A zero-padding layer is often used with convolution and 2D images. 
It places a ring of 0’s around the outside of the image. The ring can 
be as thick as we like. In 3D, it places shells of o’s around the starting 
volume. 


An upsampling layer is like a pooling layer, but it works in reverse. 
It makes the input tensor bigger, not smaller. This is typically done by 
simply repeating elements. 


Finally, a flatten layer is a special form of reshaping layer that turns 
whatever tensor is coming in, of any number of dimensions, into just 
a single big one-dimensional list. This can be useful when we switch 
from one type of processing to another. For instance, suppose we have 
a 2D image, and a list of the 5 people in it. We’d like to know who’s 
in the center. We might use a series of convolution layers to analyze 
the image, but at the end we want to transform the data into a 1D list 
that we can hand off to a fully-connected layer with 5 neurons, one for 
each person. We’d follow that with a softmax layer, providing us with 
a probability for each person being the one in the middle. This conver- 
sion from 2D to 1D is a perfect fit for a flatten layer. 
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20.5 Layer and Symbol Summary 


Figure 20.13 gathers together our schematic symbols for all the layers 
we've seen So far. 


OE + ¢ 


Dense Convolution Recurrent Sequences Flatten Reshape 


OEY 


Zero Average Max 
Dropout Batchnorm Noise Pad Pool Pool Upsample 


Figure 20.13: Schematic symbols for popular layers. Parameters for each 
layer are conventionally written to a symbol’s right for vertical diagrams, 
or below it for horizontal diagrams. Upper left: The main computational 
layers. Upper right: Utility computational layers that affect the data 
flowing through them, or computations on the previous layer. Bottom left: 
Layers that reshape the existing data. Bottom right: Layers that change 
the shape of the tensor by adding, removing, or combining elements. 


The actions of some of these layers on an example tensor are shown in 
Figure 20.14. 
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Figure 20.14: The actions of some of the utility layers in Figure 20.13 on 
an incoming tensor of dimensions 2 by 2 by 4, shown in the middle of 
the left column. Most layers have parameters that control their behavior. 
We've left those out to reduce clutter. 


20.6 Some Examples 


Let’s look at some examples of using our layers. Building a neural net- 
work using a modern library is often no harder than naming the layers 
we want, one after the other, with the proper parameters. We may be 
responsible for wiring each layer to its predecessor, or the library may 
take care of this for us. We'll see many practical examples of building 
networks with the Keras library in Chapters 23 and 24. 


Let’s start with a simple network of 2 fully-connected layers of 2 neu- 
rons each, with 2 inputs and 2 outputs, with ReLU activation functions 
on all neurons. This is shown in both the traditional written-out form, 
and our schematic notation, in Figure 20.15. 
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2, ReLU 


2, ReLU 


| | 


2 Inputs 2 Inputs 








Figure 20.15: A tiny neural network with 2 inputs, 2 outputs, and 2 
fully-connected layers of 2 neurons each. The ReLU activation function 
is used on every neuron. Left: The network using our schematic form. 
Right: The network in traditional text-and-box form. 


Let’s look at a larger network with 4 fully-connected layers, of 2, 4, 3, 
and 2 neurons. This is shown in Figure 20.16. 
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Figure 20.16: A 4-layer deep network built from fully-connected layers, 
in both schematic and box-and-text forms. 


We can build all kinds of little networks, but let’s make a big jump to 
something much bigger and more interesting. This network will have 
16 layers plus some utility layers for zero-padding, pooling, flattening, 
and applying dropout. 16 computational layers was once a lot, but by 
today’s standards it might be considered a small network. This net- 
work was built as an entry in a competition. 


The ILSVRC2014 competition was a set of public challenges in 2014, 
one of which asked people to build classifiers [Russakovsky15]. The 
acronym ILSVRC stands for “Imagenet Large Scale Visual Recognition 
Challenge.” The contest organizers assembled a huge database of pic- 
tures of objects, and manually assigned a label to each one, identifying 
the most prominent object in the picture. They provided a big piece of 
this data to entrants so they could use it to train their networks. The 
organizers than tested each submission against a different set of test 
data, and published the results for each team’s entry [Imagenet14]. 
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A big surprise was that one of the best-performing networks was mostly 
a chain of 13 convolution layers, followed by 3 dense layers (along 
with some utility layers). The network was submitted by the “Visual 
Geometry Group,” who named it VGG16 for its 16 computational lay- 
ers [Simonyan14 ]. 


Figure 20.17 shows the VGG16 network using our schematic notation. 
The network is largely built from “blocks” of a zero-padding layer fol- 
lowed by a convolution layer. Each block is repeated 2 or 3 times. 


C=s = ey 
Input: 
1 64x (8x3) 2x2, 1 128 x (8x3) 2x2, 1 256 x (8x3) 2x2, 
ReLU stride 2x2 ReLU stride 2x2 ReLU stride 2x2 


2 times 2 times 3 times 





Pal Galea 


1 512 x (8x3) 2x2, 1 512 x (8x3) 2x2, 4096 0.5 4096 0.5 1000 
ReLU stride 2x2 ReLU stride 2x2 ReLU ReLU softmax 











Figure 20.17: The VGG16 neural network for image classification. The 
network is one long stack, split here into two rows for clarity. 


The network contains 5 blocks of a zero-padding layer followed by 
a convolution layer, with each of those blocks repeated 2 or 3 times. 
Between groups of blocks there’s a pooling layer that reduces the size 
of the data by one-half in both width and height. At the end, the data is 
flattened, and then passed through a couple of dense layers with drop- 
out. The final layer is a dense layer with softmax, which produces a 
probability for each of the 1000 possible labels to be assigned to the 
input image. 


To use this network, we plug a picture into the input, and it produces 
1000 numbers, giving us the probability that the picture is of one of 
each of the 1000 possible labels. In the ISLVRC 2014 competition, 
VGG16 scored a 7.3% error (that is, it got the right answer 92.7% of the 
time). 
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We won't get into the details of the VGG16 architecture here, leav- 
ing those for Chapter 21 when we look closely at convolutional layers 
and networks built from them. At that point we'll also discuss all the 
parameter values that appear in Figure 20.17, which we’ve included 
here for completeness. 


For fun, let’s look at the performance of this system. 


We'll ask VGG16 to evaluate some images. These are not images that 
the network has ever seen before. They're a mix of public-domain 
images and pictures we shot one sunny summer day near Seattle. So 
this is a real test of VGG16 applied to brand-new images. Of course, 
we pre-processed each of these images using the same transformation 
that the original developers of VGG16 applied to their training data. 


Figure 20.18 shows a typical result. We show the picture, and the top 
five scores reported by the network. Some of the network’s labels were 
long and contained multiple variations, so we show just the first few 
letters of each of those labels here. Not only did the network identify 
this image as a bear, but it correctly identified it as a brown bear. The 
scores show that it is nearly certain in that classification. The first four 
runner-up categories barely even register. 
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Figure 20.18: VGG16 categorizing an image it’s never seen before. The 
top 5 categories are shown on the right, along with their scores. In this 
case, the system was just about certain that this is a brown bear (it is). 
Long labels were clipped to save room. For example, the full version of 
the second label is “American black bear, black bear, Ursus americanus, 
Euarctos americanus.” 


Let’s look at a few other examples. We'll start with some animals, in 


Figure 20.19. VGG16 was trained to recognize not just dogs but differ- 
ent breeds of dogs. 
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Figure 20.19: VGG16's scores for animal images. 


Let’s try a few other pictures, shown in Figure 20.20. 
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Figure 20.20: VGGI16's scores for four more images. 


This is remarkable. Given pictures from the wild that it has never 
seen before, shot by different photographers on different equipment, 
with lots of confusing cues (such as the water behind the drake and 
the spotted towel under the cat), the system correctly identified every- 
thing. And it was all but certain for several images. 


But VGG16 has its weaknesses. It was trained on 1000 different cat- 
egories. That may sound like a lot, but there are about 550,000 to 
700,000 nouns in English [Tiago16]. That leaves a lot of categories 
that VGG16 is completely unaware of. 


It’s fun to give the network images that fall into categories it’s never 
seen. We can watch as it struggles to use its limited vocabulary to 
describe these images. Note that this is a completely unfair thing to do. 
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We're asking the network to name objects it has never encountered 
before. It can’t possibly name them correctly, because it doesn’t even 
know the names to use. But just for fun let’s go ahead and see what it 
does. 


Figure 20.21 shows a set of four images from categories it’s never seen 


before. 
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Figure 20.21: Four images from categories that VGG16 has never seen 
before. Clockwise from top-left they are a dinosaur model, tulips, a spring, 
and a trowel. 


Looking at the attempts is entertaining, and they often seem to make 
some sense. For instance, it seems to have thought that the trowel was 
a kind of insect. But notice the very low confidence scores for all these 
predictions. The network didn’t know what it was looking at, but it 
knew its guesses were probably pretty bad. 
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Another set of 4 novel categories of images are shown in Figure 20.22. 
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Figure 20.22: Four more images from categories the network has never 
seen before. Clockwise from top-left they are a toothbrush, a faucet, a 
pinecone, and a stick. 


It was pretty sure that the toothbrush was a ballpoint pen, which is not 
a terrible guess. It was a bit sure that the outside faucet was a fishing 
reel. The stick sort-of does look like a stone or crocodile if we squint. 
The really weird one is that the network was just about certain that the 
pine cone was a Gila monster! 


Just for fun, let’s give VGG16 some ridiculous images. We'll try a cou- 
ple of inkblots, and a couple of photos of textures. The results are 
shown in Figure 20.23. 
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Figure 20.23: For the fun of it, here are VGG16's scores for four images 
it was not designed to handle. The top two are ink blots. The bottom 
two are a photo of a dirty field (bottom left) and a photo of a patch of 
lavender (bottom right). In all four cases, the system does its best, but it 
just doesn’t have the vocabulary to describe the images. 


A remarkable thing about VGG16 is that the network we saw in Figure 
20.17 is the whole thing. There are no details left out, no tricks or sur- 
prises. Simplicity of design and a deep stack of layers do a great job. 


Of course, training this beast is another matter. It takes careful plan- 
ning, and control of the initial learning rate and its decay schedule 
[Simonyan14]. It also took a lot of time. But that time only has to be 
invested once, and the weights can be used from then on. The architec- 
ture is like the blueprint of a palace; the weights are the crown jewels 
inside. 
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20.7 Building A Deep Learner 


When we're designing a new architecture for deep learning, the first 
step is always data preparation. We want to learn something about our 
data, and process it if necessary. 


It’s common to simply look at the raw data to get a feeling for it. If the 
data is in text form, we might open up a spreadsheet or text editor. We 
want to get a feeling for the number of samples, the number of features, 
the ranges of the data, whether there are any obvious weird entries or 
typographical errors, and so on. 


Typically, we'll then use visualization tools to plot some or all of the 
data, so we can get a better feeling for it. Are there obvious patterns 
we can exploit? Are there redundancies we can eliminate? Are some 
features almost entirely empty, and thus not helpful? We can also run 
statistical tests to help identify patterns and trends that we can’t see by 
eye, particularly if the data has more than three dimensions. 


Depending on what we're planning to do with our data, we might 
transform it, or map named features to one-hot encodings (or dummy 
variables), as we saw in Chapter 12. 


We might apply one or more of the unsupervised machine-learning 
techniques we saw earlier to transform our data. For example, we 
could use the PCA method from Chapter 12 to analyze and simplify 
our data by removing or combining extraneous features. A typical final 
step of pre-processing for neural networks is to standardize some or 
all features, so they have an average of 0 and a standard deviation of 1. 


Now we can think about what we want the network to do, and identify 
the series of layers that will make up the architecture. These choices 
are guided by experience and intuition. It can be useful to perform 
quick experiments with small subsets of the data just to try out differ- 
ent ideas and see what works. 
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Once we have the general structure in mind, we need to pick values 
for the parameters for each layer. Most layers have useful defaults for 
most parameters, but we will usually want to override at least some of 
those values to better match our data and the processing we want to 
apply to it. 


Then we have to select the hyperparameters, or the values that apply 
to the network as a whole. Generally, the most important hyperparam- 
eter to get right is the learning rate, as we discussed in Chapter 8. 


The next step is to actually run the system, teaching it with the train- 
ing data and then evaluating its performance on the test data. 


If it gives us acceptable results, we’re done! Otherwise, we have to roll 
up our sleeves and investigate. 


This is, in fact, the most common case. Much of building a great deep 
learning model comes from extensive testing, adjusting, following 
hunches, tinkering, trying more little tests, drawing and thinking about 
plots and graphs, and so on. It’s a lot of trial and error. We fiddle with 
the parameters to the layers, either tweaking them by hand or using a 
search algorithm to try out a bunch of variations. We fiddle with the 
hyperparameters like the training rate the same way. We try one thing, 
then another. We might double the number of neurons on one layer, 
take away 20% of them on another layer, and maybe add or remove 
one or more dropout layers somewhere. 


If there was a golden road to building the “best” deep learning archi- 
tecture for a given problem, everyone would follow it. But what we 
have instead is a rich publication record of architectures that eventu- 
ally worked well for different applications. 


This is one of the great values of online machine-learning compe- 
titions such as Kaggle [Kaggle16], where large numbers of people 
compete to design a learning system that delivers the best results on a 
given dataset. Some competitions even come with cash rewards, such 
as the million-dollar Netflix Prize in 2009 [Netflixog]. Because most 


911 


Chapter 20: Deep Learning 


competitions publish the architectures of at least the winning entries 
(and sometimes all the entries), we can look over what different peo- 
ple tried, and see how well their systems performed. 


We can then re-implement their approaches, and even get right to work 
if the developers shared their final weights. We can then experiment 
with that model, adding and removing pieces and changing values, 
gaining experience and improving our own intuitions. 


20.7.1 Getting Started 


The first questions that most people confront when building a deep 
learning architecture are how many layers to use, and how many neu- 
rons should be on each layer? 


The goal is to find a nice balance of layers and neurons so that our net- 
work is powerful enough to learn what we need it to discover, but no 
more powerful than that (because it can lead to over-fitting, or even 
just wasted time during both training and prediction). 


We say that we want our architecture to be able to represent the 
model. We've used the word “model” above to refer to the architec- 
ture and its weights. But the word “model” also means the same sort of 
relationship in which a plastic car kit is a “model” for a real car. It’s a 
version of the thing we’re thinking of, but probably not the thing itself. 
In this case, we’re making a mathematical model of our input data. 
This model is made up by the architecture and its learned weights. 


Together, these elements of the model make up a representation of 
the thing we’re trying to learn. Generally speaking, that representation 
is nothing like the real thing. A collection of algorithms and numbers 
is not a real weather system, a real doctor, or a real baseball team. But 
if it predicts the behaviors of those things, given the right inputs, then 
it’s a version of the real thing that can be useful to us. 
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We can mix and match the layers of a deep learning network in any 
way we like. But when we use one type of layer exclusively (or domi- 
nantly), we usually name the network for that type. 


Let’s recap some of the naming conventions we saw above. If our net- 
work is composed largely of fully-connected layers, then we call it a 
Multi-Layer Perceptron, or MLP. The name comes from thinking 
of the artificial neurons of the layers as perceptrons, and then explic- 
itly noting that we have several independent layers of them. 


If our network is largely about convolution (even if there are some 
other types in there), we call it a convolutional neural network, or 
convnet, or CNN. Most people would call the VGG16 model in Figure 
20.17 a CNN. We'll look at convnets more closely in Chapter 21. 


If our network is primarily about using recurrent modules to allow us 
to work with sequences, we call it a recurrent neural network, or 
RNN. We'll look at RNNs more closely in Chapter 22. 


20.8 Interpreting Results 


We've seen lots of different types of layers, and in Chapter 18 we saw 
how to use backprop to improve the prediction of a deep network. But 
what’s really going on? Can we explain why the network is producing 
the results it gives us? 


Let’s try to develop that intuition by considering the process of getting 
a loan. This is often an important event in someone’s life, and if we’re 
turned down for a loan we’re usually very interested to know why. This 
could help us change our situation so we can apply again later and get 
approved. 


Let’s start with how things worked long ago when loans were made 
person-to-person. Of course, we’re going to simplify everything in this 
discussion so we can focus on the steps that are of value to this discus- 
sion. In practice, any loan application is going to be a complex affair. 
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Before banks, getting a loan from a friend or colleague meant working 
up the courage to ask. The person we asked would apply their own cri- 
teria and decide whether or not they wanted to give us the money we 
asked for, and on what terms. If they said no, we might ask them why, 
and then discuss the pros and cons of their decision. Maybe one or 
both parties could change the other’s mind, or make mutually accept- 
able concessions. The key thing is that the potential lender could often 
tell us why they said no, since they know their own reasoning process. 


Then institutions like banks became the places to get a large loan for 
a car, home, or business. The bankers in a big city wouldn’t know all 
the people that came in and asked for a loan, so they’d ask applicants 
to fill out a standard application. The form would ask for a variety of 
data, such as how much money was being requested, for how long, the 
requestor’s annual income, and so on. The loan officer might look this 
all over, and based on their experience would decide whether or not to 
grant the loan. They might be reluctant to discuss their decision, but 
in principle they could explain why they decided the way they did. 


As banks got bigger, this process become more standardized. Someone 
in the bank probably sat down one day with hundreds or thousands of 
loan applications, and broke them into two piles: good loans that got 
paid back fully and on time, and bad loans where the bank lost some 
or all of its money. Based on the applications for those loans, they tried 
to come up with rules that would let them predict whether a loan was 
likely to end up getting paid back or not. 


Perhaps they did it by adding up a score. They might have noticed that 
asking for 10 times more than one has in the bank is usually a bad 
sign, so the score would become, say, —10. But if the applicant had a 
large annual income that could easily cover the cost of the loan, that 
might be worth +20, giving us a score of +10. But perhaps that annual 
income all comes from the stock market, which is uncertain, so maybe 
the officer would deduct 8 from the score. And so on, all the pieces of 
information contributing in their own way until they came together in 
a final score. If the final score was positive, the loan would be granted, 
otherwise it would be denied. 
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This is of course just like what a perceptron does. The inputs are 
weighted and combined to produce a final score, as in Figure 20.24. 
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Figure 20.24: To decide whether or not to issue a loan, a loan officer 
might take in the information on an application form, and weight the 
various pieces by different amounts to produce a final score. The value of 
that score determines whether or not the loan is granted. 


Now let’s imagine asking this loan officer for a loan. We hand over our 
application, and he applies the rules he came up with to give us a final 
score, which then tells him whether or not to grant the loan. 


Suppose we were denied the loan, and we asked him why. The officer 
could tell us all about his scoring system, and how, generally speaking, 
people who score high on some measures and low on other measures 
generally pay back their loans, and people who don’t match those cri- 
teria usually don’t pay back their loans. Our numbers landed us in the 
latter category. 


But no, we protest, you're overlooking a lot of important information. 
Maybe we have a great farm but the tractor unexpectedly broke last 
month, skewing the “monthly income” factor. Or maybe we need a new 
car to get to a great new job that’s already been offered to us, which 
will more than pay for the car. Our arguments basically boil down to 
pointing out factors that he didn’t know about or weight correctly, so 
we explain those and ask him to please reconsider his decision. 
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Depending on the person and his position, he might be able to take 
into account our new information, or he might not. If the officer sticks 
to his process, like a perceptron, then there’s no reconsideration to be 
done. It would be pointless, since the same numbers would go in, and 
the same result would come out. 


We know “why” the load was rejected, because he explained it all to us. 
But saying that the statistics weren’t in our favor is rarely an emotion- 
ally satisfactory explanation. We would probably walk out frustrated. 


We might try our luck at another bank. 


In this bank, let’s suppose that there are 5 different loan officers, and 
each one has developed his own idiosyncratic procedure for evaluating 
the criteria that go into a loan. We submit our application to the bank, 
and then all 5 officers to evaluate our application in turn. Maybe 3 of 
them say yes, and 2 say no. If it’s a simple majority vote, then we’d get 
the loan. 


We can draw this as a two-layer neural network. The input is our appli- 
cation, followed by a layer with 5 neurons, one for each loan officer, 
each reading the input. Each officer’s decision is presented at their 
output. Let’s say +1 means that they would approve the loan, —1 means 
they would deny it, and intermediate scores represent more lukewarm 
conclusions. Figure 20.25 shows this interpretation. 
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Figure 20.25: Our loan application is given to 5 different loan officers, 
each of which carries out his own distinct evaluation. Their scores are 
weighted and then added together by the bank manager to produce a 
final score, which determines if the loan is granted or not. 


Each officer’s decision goes up to a second layer with a single neuron, 
representing the bank manager. Based on his experience with his sub- 
ordinates, the bank manager trusts some officers’ judgements more 
than others, so he multiples each decision by some number before 
adding them together. As before, a positive final value means we get 
the loan, while a negative value means we don’t. 


Because the bank manager wants to make sure that the loan officers 
are using as much information as possible, he tells each one to also 
consider how neatly the form itself was filled out, how the applicant 
is dressed, the weather at the time of the application, and a bunch of 
other ancillary information. All of this can be ignored or used by the 
loan officers when making their decisions. 
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Let’s suppose that the bank is now issuing so many loans that the bank 
manager is getting overwhelmed. So he hires a bunch of supervisors 
to sit between the loan officers and himself. Each supervisor will do 
the job that the bank manager himself used to do, by evaluating the 
decisions of the individual loan officers and coming up with a final 
decision. But now those decisions from the supervisors are passed on 
to the bank manager, who makes the final decision. Let’s suppose we 
have 5 loan officers, 8 supervisors, and 1 bank manager. Then we can 
build a 3-layer neural network to represent this process, as in Figure 
20.26. 
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Figure 20.26: Each of our loan officers reports his evaluation of our 
application to a supervisor. In this example we have 5 loan officers and 8 
supervisors. The supervisors report their decisions to the bank manager, 
who makes the final decision about the loan. 


The important thing to note is that the supervisors are not making 
their decisions based directly on the information we provide when we 
ask for a loan. That is, they’re not looking at the loan amount or our 
annual income. Instead, they’re looking at the decisions of the loan 
officers. Supervisor 2, for instance, might feel that loan officer 1 is too 
generous, and therefore gives loan officer 1’s opinion less weight than 
the others. But Supervisor 4 might feel that loan officer 1 is the best 
officer out there, and gives their opinion a lot of weight. In essence, 
each supervisor is looking for statistical patterns in the decisions of 
the loan officers, and is trying to use those patterns to come up with 
their own evaluations. 
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So the supervisors combine the results of the loan officers, and then 
pass their judgements up to the bank manager. Now the bank manager 
is two steps removed from the information on our loan application. His 
final decision is based on how he chooses to weight the conclusions of 
the supervisors. 


The bank can keep adding more and more layers of officers, each one 
reviewing the decisions of the previous layer and looking for statisti- 
cal patterns. They might find that with each new layer, their statistical 
analysis becomes more accurate. Let’s say that with two more layers 
of intermediate people combining their predecessor’s scores, they find 
that the final result correctly predicts whether a loan was paid back or 
not with 99.9% accuracy. 


This would be great for the bank, but terrible for customers who want 
to understand why their loans were denied. 


20.8.1 Satisfactory Explainability 


Suppose that the bank shared with us every step of its network’s pro- 
cess. We can see every calculation made by every neuron, up to and 
through the final result from the output neuron that ended up negative, 
denying us the loan. This would be completely transparent, provid- 
ing what has come to be called explainability, meaning just that it 
explains the result. If we can’t explain the result at all, we sometimes 
say it comes from a black box, referring to an inscrutable source that 
offers us no help. That’s not our situation here, because the bank has 
told us absolutely everything about how the decision was made. Is that 
explanation of any value to us? 


Suppose that we find, upon examination of the results, that statis- 
tically speaking, people who wear blue socks, and apply for a loan 
on a Wednesday afternoon between 2:00 and 2:25, and own a car 
made between 3 and 5 years ago, and fill out their form with a black 
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ballpoint pen, are bad bets for home loans. We fulfill those conditions, 
so we've been rejected. It could be perfectly true, and it’s perfectly well 
explained, but it’s hardly satisfactory. 


These reasons are unsatisfactory because they don’t tell us why. These 
measurements can all feel arbitrary and irrelevant. They just tell us 
how things work out, statistically speaking. But we all feel that we’re 
individuals and we should be judged on our own merits, and if we’re 
being denied something we should be able to appeal that decision 
based on logic, reputation, honesty, and other qualities important to 
us in a social context. 


But those qualities are irrelevant to the network making the decisions, 
because it’s just focusing on the statistical measures it’s learned, and 
nothing else. 


So just getting an explanation of the decision is not enough. It has to 
be a satisfactory explanation. Different people might be satisfied 
by different types of explanations. One explanation that many peo- 
ple would probably find satisfactory would clearly communicate the 
factors that went into the decision, such that factors that we believe 
should be irrelevant (such as whether we're wearing blue socks or not) 
are either excluded or justified. We’d want this explanation to show us 
how we can change our situation to better improve our chances of get- 
ting approved for a loan next time. 


Beyond being unsatisfying as an explanation, the statistics-based deci- 
sion process has another weakness. The decisions are historical. Let’s 
suppose that all the analyses by all these people were performed 3 
years ago, when the town was small and money was tight. Since then, 
the town has boomed, there’s much more money flowing around, there 
are many more people, and generally the economy is looking great. 


If the bank looked with fresh eyes at the loans that succeed and fail in 
this new economy, they might well come to very different rules for how 
the evaluate them. But they can’t really do that, because their system 
won't let them make many of those loans in the first place. 
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The only way they'll come to find that their procedure is out of date is 
when many of the loans they do make don’t get paid back. But in the 
meantime, they’ve probably denied many loans to people who needed 
them and would have paid them back. They’ve hurt the town, and 
they’ve hurt themselves, because they never made any money on the 
loans they should have made. 


This is one of the dark sides of letting our networks make decisions for 
us. They don’t know when the underlying scenario changes, so they 
only find out that they’re messing up after they've messed up a lot. 


There are techniques for addressing this, and they’re getting better all 
the time [Samek17]. But explainability is still difficult. To make sure 
our systems are making decisions based on criteria that we judge to be 
appropriate, we need to stay vigilant [Domingos15]. 


Happily, there are many circumstances where environmental drift and 
lack of emotionally satisfying explanations are not problems. If our 
job is to identify the animal in a photo, then evolutionary changes may 
cause our system to start getting it wrong after a few millennia, but 
we could be okay in the short term. Many practical problems fall into 
this safe zone, from analyzing spoken requests to a digital assistant to 
working out how to drive a car on a rainy road. 


We need to be careful when our decisions start to impact people, because 
subtleties can make a huge difference to people’s lives. Subtleties in 
a learner’s training data can lead a system to a systematic bias in its 
results [Zomorodi17]. The real world is complex, and humans and 
human society are staggeringly complicated. Just because a learning 
system is able to process a set of images or statistics and produce a 
result of some kind does not automatically mean that predictions from 
that system will be accurate beyond the training data. People, popula- 
tions, and cultures can change quickly over time, and what is true on 
Monday may not be true on Friday. 


Obtaining data that is truly representative of some group of people, 
even a small group, is notoriously difficult, and may not even be possi- 
ble in practice for any non-trivial criteria [Scalas16]. Thus the resulting 
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data will be inaccurate, probably in multiple ways. If we take too seri- 
ously the results of systems trained with such flawed data, we can be 
led to absurd conclusions [Wu16] [Wang17]. Even worse are systems 
where we cannot thoroughly examine and independently vet the qual- 
ity of their training data. The results from such “black box” systems 
can do real harm [Tasheai7]. We need to be think hard about what 
data we use to train our systems, how we carry out their training, and 
how we interpret their results [Arcas17]. 


Those are the kinds of problems to keep in mind as we discuss deep 
learning methods in the following chapters. 
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called convolution to extract information 

from images and other block-like data structures. 
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21.1 Why This Chapter Is Here 


Images make up a special class of input data. We use pictures and pho- 
tographs to communicate all kinds of things for professional, social, 
and personal reasons. Whether we want to label the face of a loved 
one so we can easily find them in photographs, determine whether a 
stick-figure drawing is a person or a cat, or judge whether a smudge on 
a radiograph is a medical condition requiring a closer look, extracting 
meaning from images is important. 


This chapter will focus on extracting sense from pictures using an 
idea called convolution. Convolution is easy to use in deep learning 
because it can be easily encapsulated in a convolution layer. 


Models that feature convolution layers have been spectacularly 
successful at working with images. For example, they excel at basic clas- 
sification tasks like determining if an image is a leopard or a cheetah, 
or a planet or a marble. We can recognize the people in a photograph 
[Suni4], detect and classify different types of skin cancers [Esteva17], 
repair image damage like dust, scratches, and blur [Mao16], and clas- 
sify people’s age and gender from their photos [Levi15]. 


These models are also useful in many other applications, such as natu- 
ral language processing [Britz15], where we can work out the structure 
of sentences [Kalchbrenner14], or classify sentences into different cat- 
egories [Kim14]. 


Building useful networks from convolution is the topic of intense 
research and development, with new and surprising results showing 
up frequently. 
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21.2 Introduction 


Convolution is a well-established mathematical technique that was 
around long before computers, and has been applied to problems in 
many different fields. For example, in audio processing we can apply 
convolution to an existing recording to make it sound like it was 
recorded in a small night club, or a giant concert hall, or even outdoors 
[Hass13]. If we want to send music over the airwaves using AM or FM, 
then convolution shows us how to build our transmitters and receivers 
[Oppenheim96]. 


Although convolution can be used to process many different kinds of 
data, in this chapter we'll focus on image processing. To simplify 
our discussion, we'll talk exclusively about 2D images. In machine 
learning terms, each image is a sample. Each pixel in a grayscale 
image is a single feature. If the image is in color, then there are three 
features per sample (one each for the red, green, and blue the values at 
each pixel). 


Even if the input to a convolution layer is an image, the output will 
be a 3D tensor, or block. Unlike a grayscale image that represents its 
pixel data with 1 channel, or a color image that is made from 3 chan- 
nels (one each for red, green, and blue), the tensor coming out of a 
convolution layer can have any number of channels. 


If we really want to visualize this tensor, we can peel away the top layer 
and draw it as a grayscale image, then do the same with the second 
layer, and so on. With this process in mind, some authors use the term 
“image” to refer the tensor that comes out of a convolution layer, but 
keep in mind that this terminology is a stretch. 


Since convolution layers are often arranged in series, with the output 
of one serving as the input to the next, the inputs and outputs of these 
layers can be tensors of any size. 
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We'll see that it’s common to use multiple convolution layers in one 
network. A network where the convolution layers play a central role can 
be called a convolutional neural network, or convnet, or more 
commonly, a CNN. Sometimes people also say “CNN network” (an 
example of “redundant acronym syndrome syndrome” [Memmott15]). 


Before digging into convolution, we can save a world of confusion with 
a short detour into terminology. 


21.2.1 The Two Meanings of “Depth” 


There’s some unfortunate duplication of language that can get confus- 
ing if we don’t know it’s coming. The issue is the word depth, which 
carries two meanings. 


Every image has a size, given by its width and height. It will also 
have depth. Sometimes that refers to the number of bits in the image, 
but more frequently it refers to the number of channels. Thus we say 
that a grayscale image has a depth of 1, while a color image (with one 
channel each for red, green, and blue) has a depth of 3. 


We’ve seen in previous chapters that the word “depth” often refers to 
the number of layers in a neural network. 


Hence, two meanings of “depth,” and the opportunity for confusion. 


When we use “depth” with reference to a color image, it refers to the 
number of color channels. Most color images are represented by three 
numbers at every pixel, describing the amount of red, green, and blue 
light carried by that pixel, as in Figure 21.1. Thus we say that a color 
image has a depth of 3. 
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Figure 21.1: Each pixel in the RGB output is a composite of three values, 
one each for the red, green, and blue components. 


As we'll see below, when an image passes through a convolution layer, 
it comes out as a tensor. In this context, “depth” refers to one of the 
dimensions of the tensor, or multi-dimensional block of data, at any 
given location in the network. 


Sometimes people use the term “fiber size” for the thickness of a ten- 
sor instead of “depth” to prevent confusion, but that usage is still 
infrequent. 


In general, when talking about a tensor, “depth” refers to the size of 
one of its dimensions. When talking about a network, “depth” refers to 
the number of layers. 


21.2.2 Sum of Scaled Values 


To kick off our discussion of convolution, let’s consider just a single 
pixel in a color image. As we’ve discussed, each pixel contains 3 num- 
bers, one each for red, green, and blue. Suppose we want to determine 
if this pixel is yellow. 
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A pixel on a screen is displayed with light, and in that scenario colors 
combine additively (unlike pigments, which combine subtractively). 
Using light, we combine red and green to get yellow. 


Let’s say that each of a pixel’s three primary colors is represented by a 
number from 0 to 1. So to test if a pixel is yellow, we want red (which 
we'll abbreviate as just R) and green (G) to be nearly one, and blue (B) 
to be nearly o. 


We'd like to ‘s make a single number that represents “yellowness.” The 
larger a pixel’s value of yellowness, the more yellow the pixel is. 


One way to measure yellowness is to find R+G—B for each pixel. Figure 
21.2 shows the value we get from this formula for eight different com- 
binations of R, G, and B. The yellow pixel, with a score of 2, beats all 
the others that have scores —1, 0, and 1. 


R 0 0 0 0 1 1 1 1 
G 0 0 1 1 0 0 1 1 
B 0 1 0 1 1 0 1 


0 
F23)/0@'>0@ @,;0;8| @;/O;O 


name | black | blue | green | cyan red | magenta} yellow | white 

















R+G-B 0 -1 1 0 1 0 2 1 


Figure 21.2: We can detect yellow pixels by adding the red and green 
values together, then subtracting blue. Yellow pixels give us a value of 2 
from this, while other pixels have smaller values. 


Another way to write R+G—B is to multiply red and green by +1, and 
blue by —1, and then add up the results: (1xR)+(1xG)+(-1xB). This may 
look familiar, because this little expression has the very same structure 
as the work carried out by an artificial neuron. We show that neuron in 
Figure 21.3. In this case, +1, +1, and —1 are the three weights, and the 
numbers associated with (R,G,B) are the three inputs. Each input is 
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multiplied by its associated weight and the results are added together. 
Finishing the analogy, we need an activation function, so we'll choose 
the linear function which has no effect. 


pixel’s 
red value 


pixel’s 
green value 


pixel’s 
blue value 





Figure 21.3: Representing our yellow detector as a simple neuron. The 
pixel’s red and green values are inputs weighted with +1, and the pixel’s 
blue value is weighted with -1. The activation function is the identity 
function, shown here as a short diagonal line, which simply passes its 
input to its output without change. 


We've just created a little artificial neuron tuned to detecting the “yel- 
lowness” of a pixel. 


Figure 21.4 shows how the process we just described can be performed 
across an entire image. We treat each pixel like a “core sample” drilled 
into the 3 combined channels of the image. We extract the “core sam- 
ple,” and break it apart into three numbers, which become the inputs 
to the “yellowness” neuron. 
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Red 





Blue 


Figure 21.4: A way to draw the operation of our neuron in Figure 21.3 for 
application over an entire image. Each pixel is extracted as a “core sample” 
of three values, which are used as inputs to the neuron. This operation, 
with identical weights, is repeated for every pixel in the image. The result 
is a new one-channel image where every pixel’s value represents the 
yellowness of its corresponding input pixel. 


When we apply this neuron to all the pixels of an image, we often imag- 
ine the process as “scanning” the picture, moving the neuron from one 
pixel to the next, producing a new result pixel for each input pixel. This 
idea is shown in Figure 21.5. 
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Input Output 


Figure 21.5: One way to think of applying Figure 21.4 at every pixel is to 


imagine that we “scan” the original pixel left-to-right, top-to-bottom, and 
save the result in a new image. 


If we run this neuron over each pixel of the image, one after the other, 
and save the output, we end up with a new image (with only one 
channel, since our neuron produces only one value) that tells us the 
“yellowness” of each pixel in the image, as in Figure 21.6. 





Figure 21.6: An application of our yellow-finding operation. The image 
on the right runs from black to white, depending on the yellowness of 
the corresponding source pixel in the left image. 
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Of course there’s nothing special about yellow. We could build a little 
neuron that would assign the largest weight to any shade or color of 
our choosing, including subtle ones that are precise combinations of 
the primary colors. 


21.2.3 Weight Sharing 


In the last section we used just one yellowness neuron, which we swept 
over the entire image. 


In earlier discussions we imagined that each neuron’s weights were 
associated with its input wires, because that made them easier to name 
and discuss. But as we saw in Chapter 10, the weights are actually 
“inside” the neuron, or part of the neuron’s structure. For now, let’s 
return to thinking of them as belonging inside the neuron. So as we 
move the neuron over the image, it’s carrying its own weights with it. 


The upshot is that every pixel gets evaluated in exactly the same way, 
by the same neuron with the same weights. For each new pixel, only 
the input values, and thus the output value, change. 


Pretend for a moment that we wanted to do this yellow-finding pro- 
cess for the entire image as quickly as possible. Let’s suppose that the 
image is in a specific piece of memory, with the data for each pixel 
made available for us to connect to. 


We might build a hardware version of our yellowness neuron, and then 
attach an identical copy of that hardware (including the weights) to 
every pixel in the image. We could then evaluate all of these neurons 
simultaneously, producing an entire image’s worth of “yellowness” 
categorization in the time it takes to run one neuron. 


Now suppose we wanted to detect a different color, say “magenta.” 
Because we’ve built the yellowness weights into the hardware neurons, 
we can’t re-use those circuits. We’d have to disconnect them, build new 
magenta-finding neurons, and wire them up to our pixels. 
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To save some time and effort, let’s put the weight information for 
each neuron into a piece of memory that we can both read and write. 
We'll then place just one set of weights in some other memory some- 
where. Let’s call these the shared weights. When we want to detect 
a color in our image, all the neurons could read the shared weights 
from that location and save them internally. Then they’d all use those 
same weights when evaluating pixels. This would let us apply all the 
neurons simultaneously, as before, but we can change the weights to 
all the neurons any time we like. So if we want to search for any other 
color, we just change the shared weights that get read, as we don’t have 
to re-wire anything. 


This is an entirely practical idea if we have some parallel hardware 
around (like a GPU). Using that, we can run multiple software copies 
of a given neuron in parallel, working on many pixels at the same time, 
as in Figure 21.7. Because we use the same weights at every pixel, the 
result of running these operations in parallel is identical to that result- 
ing from scanning a single neuron over the image. It just comes out 
faster. 
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shared 
weights 





Figure 21.7: Using a parallel computer like a GPU, we can apply the same 
neuron to many pixels in the image simultaneously and independently. 
Here each circle with a multiply sign represents the operation of Figure 
21.4, where each of the pixel’s three values are multiplied by the corre- 
sponding weight, and the results added together. 


When we use one set of weights for many copies of the same neuron, 
we call this weight sharing. 


21.2.4 Local Receptive Field 


So far, our one neuron that’s scanned over the image (or applied in 
parallel using weight sharing) is working with just one pixel at a time. 
We move the neuron to the pixel we want to process, read that pixel’s 
value and run it through the neuron to compute an output, then move 
the neuron to the next pixel and repeat the process. 


We'll later see that we can connect our neurons to read several pixels 
at a time. Although these input pixels can take any shape, it’s almost 
always a square, and our neuron is at the center of the square. 
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For example, suppose that the neuron gets its input from a 3 by 3 
square of pixels (we'll see how this is done in a moment). Then the 
pixel’s location is the center of the square, and the other 8 pixels are 
in a ring around it, as in Figure 21.8. In this example, our input image 
is grayscale, so there’s one value for each pixel. Thus our neuron takes 
in these 9 input values, multiplies each one by a corresponding weight, 
and sums the results. That single value goes into the output image in 
the same location as the highlighted pixel at the center of the square. 





Figure 21.8: The neuron gets inputs in this case from a 3 by 3 square 
around the pixel. The location of the pixel in the image is highlighted in 
bright red. The set of all 9 pixels is called the local receptive field for the 
neuron. Our input image here has one channel, so the neuron receives 9 
inputs. Each is multiplied by a corresponding weight, shown in blue. 


We say that the neuron has a local receptive field, meaning that the 
region (or “field”) from which it reads (or “receives”) values is small 
(“local”). We sometimes refer to the local receptive field more simply 
as the neuron’s footprint. The local receptive field is usually a little 
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square that’s 1, 3, 5, or 7 pixels on a side, as in Figure 21.8, and cen- 
tered under the neuron. Larger squares, and even other shapes, can be 
used as well. 


We know that the result of a neuron’s evaluation of the pixels in its 
local receptive field produces a single value, which goes into a new 
pixel in the output image. But specifically where in the output does 
that pixel go? 


To answer this, we associate one of the element in the kernel as the 
anchor, or reference point, or zero point. This is highlighted in 
black in the center of the 3 by 3 grid in Figure 21.9. As we move the ker- 
nel over the image, the anchor moves with it. When we have a square 
kernel, we usually place the anchor in the center. We can say that the 
pixel that is under the anchor is the pixel that’s being evaluated. We 
can call this the focus pixel. 





image image 


Figure 21.9: The neuron has a 3 by 3 receptive field, with the highlighted 
anchor at the center. The bright red pixel in the input image, called the 
focus pixel, shows where the anchor is currently located. The output of 
the neuron goes into the output image at the same location as the focus 
pixel. 
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As we think of moving a single neuron over the face of the input image, 
we can phrase this as moving the anchor of the receptive field from 
one focus pixel to the next. At each such pixel, the neuron evaluates 
the input values, produces an output value, and saves that in the out- 
put image at the same location as the focus pixel. Then it moves on to 
the next focus pixel. 


This explains the popularity of footprints that are squares with an odd 
number of pixels on each side (often between 1 and 7). Those squares 
each have a pixel right in the center, which keeps everything simple 
and symmetrical. 


21.2.5 The Kernel 


When we think about a neuron as a single thing that’s moved across 
the image (or applied in parallel with shared weights), we often call its 
weights a kernel or filter. 


The word “kernel” comes from mathematics, where it has been used 
for a long time to refer to these values that form the conceptual core, or 
“kernel,” of operations like the one performed by our artificial neurons. 
The word “filter” comes from thinking about the neuron as manipulat- 
ing, or “filtering,” input data. 


We sometimes extend the word “filter” to include the neuron itself, so 
we might say, “We move the filter over the image.” We also use it as a 
verb, so we might say, “The next step is to filter the image,” meaning 
that we'll apply a specific set of weights to the image’s pixels. 
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21.3 Convolution 


The name “convolutional neural networks” makes it pretty clear that 
“convolution” plays a big part in their operation. AS we mentioned 
before, convolution is the name for a particular type of mathemati- 
cal operation that involves two inputs. Although we won’t go into the 
mathematical formalities of convolution, we can summarize it as basi- 
cally a carefully-choreographed combination of multiplication and 
addition. 


The good news is that we’re already familiar with convolution, because 
it’s what our artificial neurons have been doing all along, even though 
we haven't described them in this particular way. 


Convolution starts with two lists of numbers of equal length. Using our 
existing language, let’s call one list the input values, and the other list 
the weights. We then multiply the first input and the first weight, the 
second input and the second weight, and so on. When all the multipli- 
cations are done, the results are added together, and that’s the result 
of the operation. 


And that’s convolution. We'll note in passing that we’ve left off 
a few details that belong to the formal definition of convolution 
[Oppenheim96]. That’s okay because those details don’t affect the big 
picture, so unless we’re programming the low-level algorithms directly, 
we can Safely ignore them. 


Figure 21.10 shows this in action. Though we’ve moved the pieces 
around graphically, this is doing the same thing as our pictures of neu- 
rons that weight their inputs and sum the results. 
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Weights 


Figure 21.10: A step of convolution involves two lists, which we can 
call “inputs” and “weights,” though both lists are treated the same way. 
Corresponding values are multiplied together pairwise, and then the 
results are added together. This is what an artificial neuron does, ignoring 
the activation function. 


With just a little conceptual change, this idea suddenly becomes a pow- 
erful tool for working with images. 


The change is that instead of thinking of our inputs and weights as just 
simple lists, we think of them as grids of numbers (in fact, they can be 
tensors of any shape, but we'll stick with grids for now). The two grids 
must be of the same shape. The operation is just the same as before: 
each element of the input grid is multiplied by the corresponding ele- 
ment of the weight grid, and the resulting values are then all added 
together. Figure 21.11 shows the idea. 
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Weights 


Figure 21.11: Convolution in 2D works just like in 1D, though the picture 


gets more cluttered. Each pair of corresponding values gets multiplied 
together, and then all of those products are added together. 


The reason why we say our change from a list of values to a grid is only 
conceptual is because we can always disassemble a grid into a list, and 
then use the list-based version of Figure 21.10. For instance, just make 
a new list that contains the first row of the input grid, followed by the 
second row, then the third row, and so on, as in Figure 21.12. We can 
then do the same thing for the weights, and then use our previous fig- 
ure to multiply corresponding list entries. 
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Figure 21.12: We can think of 2D convolution as just 1D convolution if 
we first reshape each 2D grid into a single list. We just take the top row, 
append the second row, then append the third row, and so on. If we do 
this for both grids, we can just multiply the elements together pairwise 
and add up the result. 


Let’s say we have a neuron has a filter composed of 9 weights, arranged 
in a 3 by 3 grid. As we discussed earlier, we can move this filter over 
the image. At each location we'll have 9 pixels providing us with input 
values. We can multiply the two grids (or lists) together, add up the 
results, and thus have a value for that pixel in a new image. 


We call this convolving the filter with the image, meaning that we 
move the filter so that its anchor goes from one pixel to the next to the 
next, and at each point we gather up the pixel values in its footprint, 
multiply those with the corresponding weight in the kernel, and add 
up the results to produce an output. 


Figure 21.13 shows the idea. In this image, we have a 3 by 3 filter sweep- 
ing over a 7 by 7 image. We haven't discussed what might happen if 
the filter falls “off the edge” yet, so for now we'll just limit ourselves to 
those locations where the filter sits entirely on top of the image. That 
means that the output image is only 5 by 5. 
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Figure 21.13: To convolve an image with a filter, we move the filter across 
the image and apply it at each position. The resulting value then becomes 
the value for that pixel in the result. Here are some positions of the filter 
in the input, and the positions where their computed values go into the 
output. Note that because the filter can’t extend past the edges, the input 
is 7 by 7 but the output is only 5 by 5. 


Why would we want to do something like this? Let’s look at filters more 
closely. 


21.3.1 Filters 


Some scientists who study toads think that certain cells in the ani- 
mal’s visual system are sensitive to specific types of visual patterns 
[Ewert85]. The idea is that the toad is looking for specific shapes that 
look like the creatures it likes to eat, and to certain motions that those 
animals make. 


People used to think that a toad’s eyes absorbed all that light that struck 
them, sent that mass of information to the brain, and then relied on the 
brain to sift among the results looking for food. The new hypothesis is 
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that the cells in the eye are doing the initial detection all by themselves, 
and they only fire and pass on information to the brain if they “think” 
they’re looking at prey. 


The idea has been extended to the human system, where it has been 
hypothesized that individual neurons fire in response to pictures of 
specific people. The original study that led to this suggestion included 
87 different images, including people, animals, and landmarks. In at 
least one volunteer they found a specific neuron that only fired when 
the visual system was presented with a photo of the actress Jennifer 
Aniston, leading to the idea of the so-called “Jennifer Aniston neuron” 
[Quirogao5]. Curiously, that neuron only fired when Aniston was alone, 
and not when she was pictured together with other famous actors. 


These ideas are not universally accepted [Sciffmano1], but we’re not 
doing real neuroscience and biology here. We’re just looking for inspi- 
ration. And this seems like some pretty great inspiration. 


The connection to convolutional layers is that we can use filters to sim- 
ulate the toad’s eyes. The filters are the tools that pick out the patterns 

we're looking for, and then pass on their discoveries to later layers 

which can then process that information. 


Some of the terminology that is used for this process uses terms that 
we've seen before. Specifically, we’ve been using the word “feature” to 
refer to one of the values contained in a sample, such as the tempera- 
ture in a sample that contains multiple measurements of weather. But 
in this context, the word feature refers to the particular structure of an 
image that a filter is looking for. So we might say that a filter is looking 
for stripe feature, or a feature that looks like an eyeball. Continuing this 
usage, the filters themselves are sometimes called feature detectors. 


Let’s see how feature detection works with a simple example. In Figure 

21.14 we show the process of applying a filter looking for short, iso- 
lated vertical white stripes to an image. Because the various pieces of 
this figure use different ranges of numbers, we’ve used different colors 

to show their values. 
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Figure 21.14(a) shows a 3 by 3 filter. The red cells show where the filter 
has a value of —1, and the yellow cells show where the filter has a value 
of 1. Figure 21.14(b) shows a noisy input image, ranging from o (black) 
to 1 (white). Figure 21.14(c) shows the result of applying the filter to 
each pixel in the input image (except for the outermost border). Here 
the values range from —6 (in purple) to +3 (in cyan). As we'll see below, 
a score of +3 means that the filter and the image were a perfect match. 





Figure 21.14: 2D pattern matching with convolution. (a): A 3 by 3 pattern 
of a column of 1’s (yellow), surrounded by -1’s (red). (b) A noisy input image 
of 0’s (black) and 1’s (white). (c) The output of the filter placed over every 
pixel in the image. The outputs run from -6 in purple to +3 in cyan. (d) 
A thresholded version of part (c), where pixels with a perfect score of +3 
are in white, and others are in black. (e) The input image of part (b), but 
the 3 by 3 block around each white pixel in part (d) is highlighted. 


Figure 21.14(d) shows a thresholded version of Figure 21.14(c), where 
pixels with a value of +3 are shown in white, and all others are black. 
Finally, Figure 21.14(e) shows the noisy image of Figure 21.14(b) with 
the 3 by 3 grid of pixels around the white pixels in Figure 21.14(d) 
highlighted. We can see that the filter found those places in the image 
where the pixels matched the filter’s pattern. 


Let’s see why this worked. In the top row of Figure 21.15 we’ve shown 
our filter and a 3 by 3 patch of the image, along with the pixel by pixel 
results. In this situation, only one of the white pixels (the top center) is 
matched by a 1 in the filter. This gives a result of 1x1=1. The others are 
matched up with -1, giving results of -1x1=—1. The zeros in the image 
are irrelevant, since whether we multiply them by 1 or —1 we still get 
back o. Adding up the —1 score for the three white pixels in the corners 
with the 1 score for the white pixel in the top center gives us —3+1=—-2. 
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Figure 21.15: Applying a filter to an image. The two rows show two 
different pieces of the image. Left: The filter. Middle: The section of the 
image, containing either white pixels (1) or black pixels (0). Right: The 
result of multiplying each image pixel by its corresponding filter value. 
Adding up the nine values gives us the final sum on the right. 


In the lower row, our image matches the filter. All three white pixels 
contribute 1, and there are no out-of-place white pixels that would pull 
the score down by contributing —1. There are also no missing white 
pixels, which would also bring our score down by not adding 1. The 
result is a score of 3, indicating a perfect match. 


This process is a close match to our yellow-finding neuron of Figure 
21.3, except here we’re using only one weight per pixel, and our weights 
are spread out over multiple pixels. 


Figure 21.16 shows another filter, this one looking for diagonals. We'll 
run it over the same image. This diagonal of 3 white pixels surrounded 
by black is present in exactly one place in our random image, near the 
lower-left corner. 
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Figure 21.16: Another filter and its result on our random image. The three 
diagonal white dots surrounded by black are only found in only one place. 


So by moving the filter over the image and measuring the final value at 
each pixel, we can hunt for the patterns we’re looking for. With larger 
filters that contain more nuanced values than simply +1 and —1, we 
can make much more complex patterns to find more interesting fea- 
tures. We can even perform image processing operations like blurring 
and sharpening the image [Snavely13]. 


If we take the output of a first set of filters and feed them to another set, 
we can look for patterns of patterns. If we feed that second set of out- 
puts to a third set of filters, now we’re looking for patterns of patterns 
of patterns. We can repeat this many times, creating a deep hierarchy 
of filters. Perhaps surprisingly, such a hierarchy allows us to look for 
complex features of any orientation or size, whether it’s the face of a 
friend, the grain of a basketball, or the eye on the end of a peacock’s 
feather. 


If we had to work out these filters by hand, classifying images would 
be tedious at best. What are the proper filters in a hierarchy 8 levels 
deep that will tell us if a picture is that of a kitten or an airplane? How 
would we even go about working out that problem? And how would 
we know when we had the best filters? 


The beauty of CNNs is that we don’t have to figure out which filters we 
need, because the computer does it for us. 


The learning process that we’ve seen in previous chpaters, involving 
measuring error and then improving weights with backprop, teaches a 
CNN to find the best filters. It modifies the weights, (that is, the values 
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in the kernel) of each filter, until the network is producing results that 
match our targets. In other words, it tunes the values in the filters until 
they find the features that enable it to come up with the right answer. 
And it can do this for hundreds or even thousands of filters, all at once. 


It can seem almost magical that this can work, but the training is basi- 
cally just standard backprop and gradient descent. At the broadest 
level, we just modify each weight in each kernel in such a way that it 
follows the downhill error gradient and thus brings down the overall 
error. Do this enough times, and the weights will form filters that give 
us all the information we need to transform an input into an output 
that matches the label. 


21.3.2 A Fly’s-Eye View 


Let’s look at an alternative way to visualize this whole process. Instead 
of sliding a filter over an image, imagine taking a photograph and 
breaking it into many small, overlapping pieces, each the size of the fil- 
ter. Then we plunk the filter over each piece and perform the filtering 
operation as before, multiplying each pixel value by its corresponding 
filter value [Geitgey16]. Figure 21.17 shows the little pieces of the input 
image that we’d then place the filter on top of. 
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Figure 21.17: Rather than thinking of sweeping the filter across the 





image, we can imagine breaking the image up into overlapping pieces, 
like a fly’s-eye view of the image. Then we apply the filter to each piece, 


producing the output values (image after [Geitgey16]) . 
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In this view of convolution, we place the filter over each little overlap- 
ping piece of the input image, rather than sweeping it over the original 
image. 


Both ways of thinking about this yield the same result. 


21.3.3 Hierarchies of Filters 


We’ve taken inspiration from biology already in this chapter, and we 
can do it again. 


Many real visual systems seem to be arranged hierarchically 
[Serre14]. In broad terms, we think of the processing in the visual sys- 
tem as taking place in a series of layers, with each successive layer 
working at a higher level of abstraction than the one before. Returning 
to the visual system of a toad, the bottom-most layer might be looking 
for “light-colored blobs,” the next “combinations of things from the 
previous layer that also have wings,” the next “combinations of things 
from the previous layer that are moving in short fast bursts,” and so 
on up to the top layer which looks for “flies” (these features are com- 
pletely imaginary and used just as illustrations of the idea). 


This approach is nice conceptually, because it lets us structure our 
analysis of an image in terms of a hierarchy of image elements, and the 
filters that look for them. It’s also nice for implementations, because 
it’s a flexible and efficient way to analyze an image. 


Let’s look at a simplified example to get a feeling for the process. We'll 
try to find a face in a 27 by 27 binary image. Let’s suppose we want to 
find the face on the left of Figure 21.18, but our input image contains 
all kinds of other stuff, shown on the right of Figure 21.18. Since we 
know the exact locations of all the pixels we’re interested in, we could 
just check for their presence directly. But let’s see how to solve this 
problem using a hierarchy of convolution filters, because that more 
general approach is what we'll use later to find objects even in complex 
color photographs. 
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Figure 21.18: Finding a face in a 27 by 27 binary image. Left: The face we'd 
like to detect. Right: The input image we're given. 


The overall flow of our strategy is shown in Figure 21.19. We'll begin 
on Layer 1 by applying five little filters, each one just 3 by 3. Our first 
filter will look for 3 black elements in a horizontal row surrounded by 
white elements above and below. We'll call that filter H, for horizontal. 
We'll make a similar filter for a vertical stripe and call the filter V. We'll 
also look for solid 3 by 3 blocks, and call that filter B. To include a bit 
of detail, we'll also look for left and right ends of horizontal lines as a 
single black pixel surrounded by white pixels, and we'll call those fil- 
ters L and R. These five filters make up our first layer. 
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Figure 21.19: Using 3 layers of filters, each filter only 3 by 3 pixels, to find 
a forward or profile face in a 27 by 27 image. 


After these filters have run, we’ll use max pooling to reduce their out- 


put dimensions by 3. That is, we'll consider the output of each filter as 
a set of non-overlapping 3 by 3 blocks. Any time we find a block that 
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contains a cell that matched that filter, we’ll mark in black the corre- 
sponding cell in the lower-resolution output. So each of the five filters 
produces an output that’s 27 by 27, which we then reduce to 9 by 9. 


We're reached Layer 2, where we'll apply three more filters, each also 
of size 3 by 3. These higher-level filters will examine the 9 by 9 images 
coming out of Layer 1. Theyre looking for combinations of our 5 low- 
est-level building blocks that make up pieces of our face. We'll mark 
each entry in these filters with the building block we want, or an X if 
we don’t care what’s in that cell. We'll make a filter for the nose, which 
is just a little vertical line over a horizontal line (we'll call that filter N), 
a filter for an eye, which is a block of pixels under an eyebrow made up 
of a horizontal line with left and right ends (we'll call that filter E), and 
a filter for the mouth, a long horizontal line with a shorter horizontal 
line under it (we'll call that M). Once again we'll apply pooling, so each 
filter’s final 9 by 9 output is reduced to 3 by 3. 


Finally, we reach Layer 3. Here we'll apply 2 new filters, again each 3 
by 3. One will look for either a face looking forward (we'll call that F), 
another for a face in profile to the right (we'll call that P). Our profile 
image would look pretty weird, but that’s okay for this demonstration. 


Let’s see these filters in action. Figure 21.20 shows the H, V, and B fil- 
ters running over the original image. Since we’re convolving our filters 
with the image, we'll move the center of each filter so it’s over each 
pixel in the input, and determine if there’s a match (that is, the white 
and black pixels in the filter match the white and black pixels in the 
image). Each pixel that matches will get outlined in red. 
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Figure 21.20: Applying filters H, V, and B to our original 27 by 27 image. 
Top row: The filters we're using. Middle row: Each pixel that matches the 
filter is highlighted in red. We also show the 3 by 3 blocks that are used 
for max pooling. Any block with a red cell has been highlighted with a 
thicker outline. Bottom row: The result of the pooling operation. 3 by 3 
cells that were highlighted in the middle row are marked in black. 


After we’ve found our matches, we'll use max pooling with 3 by 3 
blocks to reduce the size of our input image, from 27 by 27 to 9 by 9. 
Any block that has at least one match inside becomes black. Note that 
some blocks have more than one match, such as near the middle-left 
of the V filter. We don’t do anything special for those blocks, but just 
mark them in black like any other block that contains a match. 


The results for the L and R filters are shown in Figure 21.21. 
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Figure 21.21: Applying the L and R filters to our input face, using the 
same conventions as in Figure 21.20. 


Now it’s time for filter level 2. In Figure 21.22, we’ve combined all 5 
of the output images into one diagram where each cell is marked 
with one or more letters, so the filter outputs are easier to take in as a 
group. In our simple example, most blocks had only one match with 
any of the filters, though the block just to the lower right of the center 
was matched by both the L and R filters. We apply our eye, nose, and 
mouth filters E, N, and M just as before, but this time we won’t demand 
that all the values match. Recall that an X in a cell means “don’t care”, 
which we can think of as a kind of wild card that matches everything. 
For example, the upper-right 3 by 3 block in the composite diagram 
almost matches the E filter, but there’s an extra R in the bottom right. 
Since that cell has an X in the corresponding entry in the filter, that 
block still matches when the E filter is centered over the B. 
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Figure 21.22: Top row: Our three second-layer filters. Second row: The 
summary of the first layer. Each cell is marked with the filter that matched 
it, if any did. Third row: The E, N, and M filters are moved over the image 
in steps of 3. Cells they matched are shown in black. Bottom row: The 
output of each filter is a new image, now only 3 by 3. 


As before, we'll apply max pooling using 3 by 3 blocks, so our output 
from this layer consists of 3 images, each 3 by 3. 


Now we're ready for the third and final layer, which tells us if the input 
image contained a forward-looking face, a face in profile, or neither. 
We just apply the two 3 by 3 filters to the results of the second layer, as 
in Figure 21.23. There’s no need to move the filters around, since the 
inputs are 3 by 3. Here, the forward looking face matched, while the 
profile did not. 
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Figure 21.23: Applying our third-level filters to the output of the second 
level, using the same layout as in previous figures. The forward filter 
matches, while the profile filter does not. 


We've detected our face in the presence of all kinds of other distracting 
objects in the image! 


All we had to do was design our 10 little filters, and we could match 
our face, even when there was lots of other stuff in there to distract us. 
If we wanted to look for a different type of nose, we could just rede- 
sign the nose filter. Or we could look for many kinds of facial features 
at once using a filter for each type of eye, nose, and mouth, and then 
make higher-level filters that match various combinations to help us 
tell which kind of face we’ve been given. 


If we had to design these filters by hand for every project this wouldn’t 
be a very attractive algorithm. But a deep learning system can auto- 
matically learn the best values for the filters from the inputs after it’s 
been exposed to many labeled examples during training. 


Our example used a binary image so that it was easier to see what was 
happening, but in practice we often use grayscale and color images. 
In these cases, our filters will contain floating-point values. While the 
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black and white filters could only report a match or the lack of it, these 
floating point filters can return a floating point number, where a larger 
number means it made a better match with the data at that location. 


A great thing about convolution is that the filters are able to find what 
they’re looking for anywhere in the image. For example, the H filter 
found all horizontal runs of 3 black pixels with white pixels above and 
below, everywhere in the image. When we're looking to process com- 
plex photographs of the natural world, this gives us the flexibility to 
detect image features robustly, even if they're not exactly where we 
expect them. 


There’s a sense in which our filters are getting more powerful as we 
work our way up the levels. Consider that a 3 by 3 filter on the second 
level is effectively responding to a 9 by 9 region, since each of its pix- 
els is the result of the previous step which reduced the image size. For 
example, our eye filter E is processing a 9 by 9 region, though it’s only 
3 by 3 itself. In this way, the filters at higher levels in a hierarchy are 
able to look for large and complex features, even though they use only 
small (and therefore fast) filters. 


Higher levels are able to combine the results of lower levels in multi- 
ple ways. Suppose we want to classify a variety of different birds in a 
photo. Low-level filters might look for feathers or beaks, while higher 
filters are able to combine different types of feathers or beaks to recog- 
nize different species of birds, all in a single pass through a photo. 


This technique is sometimes referred to as working with a hierarchy 
of scales. 


21.3.4 Padding 


Let’s return to convolution and look at what happens near the edges of 
an input. 
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Suppose that we want to apply a 5 by 5 filter to a black-and-white 
image. If we’re somewhere in the middle of the image, as in Figure 
21.24, then our job is easy. We pull out the 25 values from the image, 
scale them by the 25 values in the filter, and sum up the result. 


Figure 21.24: A five by five filter located somewhere in the middle of an 
image. The bright red pixel is the anchor, while the lighter ones make up 
the receptive field. 





But what if we’re right on an edge, as in Figure 21.25? 











Figure 21.25: Near the edge, the filter’s receptive field can fall off the 
side of the image. What values do we use for these missing pixels? 
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The footprint of the filter is hanging off the edge of the image. There 
aren’t any input values there. How do we compute an output value for 
the filter when it’s missing some of its inputs? 


We have a few choices. One is to disallow this case, so we can only place 
the footprint where it is entirely within the input image, as we’ve been 
doing so far. Any pixels where we can’t place the filter get dropped 
from the output, making it smaller in each dimension. Figure 21.26 
shows this idea. 





Figure 21.26: We can avoid the “falling off the edge” problem by never 
letting our filter get that far. With a 5 by 5 filter, we can only center the 
filter over the pixels marked here in blue. The resulting 6 by 6 output, 
shown on the right, is smaller than the 10 by 10 grid we started with. 


While simple, this is a lousy solution, because if we apply multiple 
convolution filters to the same image, it will shrink on every step. We 
could end up with just a very small piece of the image as our input 
which isn’t a good result. 


? 


A popular alternative is to use padding. The idea is that we add a 
border of “extra” pixels around the outside of the image, as in Figure 
21.27. All of these pixels have the same value. By far the most common 
choices is simply zero. This choice is called zero-padding. 
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Figure 21.27: A better way to solve the “falling off the edge” problem is to 
add padding, or extra pixels, around the border of the image. Here we've 
added a 2-pixel border, so every pixel in the original image (shown in 
white) can be used as the center of the filter. Usually, the padded pixels 
are given the value O. 





The size of the border depends the size of the filter. We usually use just 
enough padding so that the filter can be centered on every pixel in the 
original image. Libraries might offer automatic calculation of the pad- 
ding size based on the filter size, or they might require us to specify it 
manually. Normally we never explicitly place padding into our input 
images. Instead, we leave it to the library to create (or presume) these 
pixels when they’re needed. 


Using padding, we can create an output image of the same size as the 
input. 


21.3.5 Stride 


When we sweep a filter over an image, we can imagine it moving the 
same way we read a book. For the moment, let’s assume we're using 
padding. The filter will start in the upper-left pixel of the input image, 
produce an output, then take one step right, produce an output, move 
another step right, and so on until it reaches the right edge of that line. 
Then it moves down one line and back to the left side, and the process 
repeats. 
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But we don’t have to be so methodical. Suppose we move, or stride, 
more than one pixel to the right, or more than one pixel down, as we 
sweep our filter? Then our output will end up being smaller than the 
input. 


We'll see that there are a few different ways to use striding, all of which 
are important when we use convolution layers. 


To visualize striding, let’s think of our output as initially a blank slate. 
As the filter moves left to right, it produces a series of outputs, and 
those get placed one after the other, also left to right, in the output. 
When the filter moves down, the new outputs is produces go on a new 
line of cells in the output. 


If we move, or stride, 1 pixel in each direction, as we’ve been doing 
so far, we get the results shown in Figure 21.28. Here we've left off 
padding. 





Figure 21.28: Sweeping a 3 by 3 filter over a5 by 5 image, without padding. 
Each step of the filter moves it to the right by one pixel in the input, and 
then it moves down by one pixel. The diagram shows 4 of the 9 filter 
positions that would make up this output. 
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But we could skip over pixels horizontally, or vertically, or both. For 
instance, we might move to the right by 3 pixels on each horizontal 
step, and then move down by 2 lines on each vertical step. The output 
pixels are still assembled as before. The result is a new image that is 
one-third the size of the original horizontally, and one-half the size 
vertically. This is shown in Figure 21.29. 





























































































































Figure 21.29: Our input scanning can skip over pixels. In this example, 
we look only at every third pixel horizontally, and every other pixel verti- 
cally. In other words, we use a stride of 3 horizontally and 2 vertically. 
For clarity, we’re not using padding in this figure. The 5 by 9 input image 
turns into a 2 by 3 output. 


Another look at the pixels that served as locations for the filter’s anchor 
in Figure 21.29 is shown in Figure 21.30. 
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Figure 21.30: The six pixels where the filter was anchored in the convo- 
lution of Figure 21.29. 


We'll see later that this is a fast way to reduce the size of the input 
image, in order to speed up later blocks in the network. 


In Figure 21.29 we used a stride of 3 horizontally, and a stride of 2 ver- 
tically. More often we specify a single stride value for both axes. The 

stride can be any positive integer, starting with 1. The default stride, as 

shown in Figure 21.28, is 1, meaning that we step 1 pixel on each move, 
and none are missed. A stride of 2 on both axes can be thought of as 

taking every other pixel both horizontally and vertically, and similar 

thinking holds for a stride of 3 or more. Pictures of these two sets of 
strides are shown in Figure 21.31. 
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(a) (b) 


Figure 21.31: Examples of striding. (a) A stride of 2 in both directions 
means evaluating every other pixel, both horizontally and vertically. (b) A 
stride of in both directions 3 means evaluating every third pixel. 


We can use striding to prevent filters from overlapping, which is useful 
when we want to shrink an image. For instance, if we’re moving a 3 by 
3 filter over an image, we might use a stride of 3 so that no pixel gets 
used more than once, as in Figure 21.32. 
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Figure 21.32: Using a stride of 3 in each dimension with a filter of size 3 
by 3 means that each pixel in the input will be used only once. Read the 
figure top-down, left-to-right. Pixels shaded in gray have already been 
covered by the filter. The output image will be one-third the size of the 
original in each dimension. 


21.4 High-Dimensional Convolution 


In our examples above we've been working with black-and-white 
images. That is, each image has only one channel of color information. 
We know that color images have at least 3 channels, most usually rep- 
resenting the red, green, and blue components of each pixel. Let’s see 
how to handle those. Once we can work with images with 3 layers, we'll 
also know how to work with tensors of any number of layers, such as 
the outputs of previous convolution layers. 


One way to handle a color image is to apply the same filter to each 
channel. Alternatively, we could make a filter that applies different 
weights to each channel. 
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It’s easy to make such a filter. We turn our previous filter, which was 
just a grid of weights, into a stack of filters, one layer for each channel. 
Figure 21.33 shows the idea. In other words, our kernel moves from a 
2D grid to a 3D block that contains 27 values. 





Figure 21.33: When we apply a filter to a three-channel image, such as 
a color picture made up of red, green, and blue values at each pixel, we 
can apply three different filters, one for each channel. We can picture the 
three filters as a little stack, shown here for filters with a 3 by 3 receptive 
field. 


To apply this kernel to a 3-channel color image, we proceed much as 
before, but now we think in terms of blocks (or tensors of 3 dimen- 
sions) rather than grids (or tensors of 2 dimensions). 


Returning to our earlier idea of a “core sample,” let’s suppose our filter 
has a 3 by 3 footprint, and we’re going to process an RGB image with 
3 color channels. So we pull out a “core sample” from the RGB image, 
but now it’s a volume that is 3 pixels on each side (3 for height and 
width, because the filter is 3 by 3, and 3 for depth, because there are 3 
channels). 


Then we multiply each element of the block with the corresponding 
element in the block representing the filter. 
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If we want to produce a single value representing how closely the RGB 
image matched the tensor of the kernel, we could add up all 27 mul- 
tiplied values and produce a single value for a new output image with 
just one channel, as in Figure 21.34. 


blue green red 
blue _, sliceof green _, slice of red slice of 
pixels x filter pixels X filter pixels X filter 


Figure 21.34: Convolving an RGB image with a 3 by 3 by 3 kernel. We pull 
out the 9 red, green, and blue pixel values for the 3 by 3 footprint, and 
multiply those elements with the corresponding slice of the kernel. We 
can add up all the results to produce a single value. 


Let’s suppose that we wanted to produce an RGB image as output, but 
with each color channel modified by its own filter. Then instead of using 
one 3D kernel, we can use three separate 2D grids. Then each filter is 
applied to its own channel, and produces its own output. The result is 
a new tensor with 3 channels, one from each filter. Figure 21.35 shows 
this approach. 
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Figure 21.35: Here’s how we could apply 3 different kernels to an RGB 
image to create an output RGB image. 


These ideas can be extended to images with any number of channels, 
as we'll see in the next section. 


We might choose to increase the efficiency of our implementation by 
applying many identical filter kernels simultaneously using parallel 
hardware. In that case, we would use the same weight sharing we saw 
before, meaning that all the filters still get their weights from a single 
source. 
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21.4.1 Filters with Multiple Channels 


We can generalize the ideas above to use multiple filters on a single 
input. 


For example, suppose we have a black-and-white image, and we want 
to look for eyeballs, baseballs, volleyballs, and meatballs. We could 
create one filter for each of these features, and run each one over the 
input independently. The result would be four output images, each one 
channel deep, one from each filter. Figure 21.36 shows the idea. 





Figure 21.36: We can run multiple filters (in color) over the same input 
(in gray). Each filter creates its own channel in the output. 


So instead of producing a grayscale image with one channel, or a color 
image with three channels, we now have an image with four channels. 
If we used 7 filters, then the output would be a new image with 7 chan- 
nels. At that point we’d probably want to stop calling it an image and 
refer to it more generally as a tensor. 


The key thing to note here is that each filter has 1 slice, matching the 
input. If the input was 2 pixels deep, each filter would need to also be 
2 pixels deep. 
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Generally speaking, our filters can have any footprint, and we can apply 
as many of them as we like to any input image. What’s most import- 
ant is that the number of channels in the filter matches the number of 
channels in the input. 


Figure 21.37 shows this idea. The input tensor at the far left has 7 chan- 
nels. We’re applying four different filters, each with a 3 by 3 footprint, 
so each filter is a tensor of size 3 by 3 by 7. The output of each filter 
goes into its own output which is a single channel deep. Since we’re 
applying 4 filters, the output tensor is 4 channels deep. 




















Hee 























Figure 21.37: When we convolve filters with an input, each filter must 
have as many slices as the input. Here the input is 6 channels deep, so 
each filter is 6 channels deep. The 4 filters each create an output of 1 
channel, so the final output has 4 channels. 


Although in principle each filter we apply can have a different footprint, 
in practice we almost always use the same footprint for every filter in 
any given convolution layer. For example, in Figure 21.37 all the filters 
had a footprint of 3 by 3. If another convolution layer follows this one, 
the filters on that new layer could have a footprint of any size. 


We'll see below how convolution layers manage all of this accounting 
for us. Jumping ahead a little, to add the entire operation of Figure 
21.37 to our network, all we need to specify is that we want 4 filters, 
each with a footprint of 3 by 3 and a depth of 6. If the library can 
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automatically match the input tensor’s depth, it will make us 4 filters 
of size 3 by 3 by 6, and initialize them with random values. During the 
forward pass the filters will all be applied and the results combined 
into the 4-channel output. During backpropagation, the 54 values 
(3x3x6) in each filter will get adjusted to improve the error at the net- 
work’s output. 


24.4.2 Striding for Hierarchies 


We saw earlier that using a series of convolutions on an image of 
ever-decreasing size lets us efficiently look for objects made up of lots 
of elements. Striding makes this easy, because it inherently allows us 
to output an image that’s smaller than the input. For example, if the 
stride is 2 units in some dimension, the output will be 1/2 the size in 
that dimension. If the stride is 3 units, the output in that dimension 
will be 1/3 the size, and so on. 


Let’s suppose we start with a 600 by 600 black and white image, as in 
Figure 21.38 Our first convolution layer will apply 8 filters, each of size 
5 by 5, to this 600 by 600 image. We'll pad the image with 2 rows of 
o’s all around so the image won’t get smaller, but we'll use 2 steps of 
stride. This means we'll get back an image of half size, or 300 by 300, 
with 8 channels. 


300 











Convolution 2D 300 Convolution 2D 
8 x (5x5) 4 x (3x3) 














stride = (2,2) stride = (3,3) 
padding = (2,2) padding = (1,1) 














Figure 21.38: Creating a hierarchy of convolutions. The first convolution 
works on the full 600 by 600 image. The second works on a 300 by 300 
version of that image that contains the results of the first stage’s 5 filters. 
Each stage uses a lower-resolution version of the previous stage, so it can 
work with larger collections of features without requiring larger filters. 
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Now we'll run 4 new filters over this image, each filter 3 by 3, with 4 
channels. We'll use a stride of 3 and 2 steps of padding, getting back a 
tensor size 100 by 100 by 4. 


Because the image has been scaled down at each step, each succeeding 
set of filters is working with a lower-resolution version of the original. 
This means they run faster, and can look for larger features, since a 3 
by 3 filter on a 300 by 300 image can be thought of roughly as a 6 by 6 
filter on the previous layer’s 600 by 600 image. 


Conceptually, one of the filters on our first convolution might be 
implementing the toad’s hypothetical “looking for light-colored blobs,” 
while another filter is looking for “things with wings.” Then the next 
convolution can look at these two results simultaneously, and look for 
“blobs that have wings.” 


We can keep up this shrinking until we end up with a tensor of just 
one pixel on a side, though that’s rare in practice. The only rule is, as 
we saw above, that the filters at each step must have one slice for each 
channel in the image that they work on. 


In this example, we’ve reduced the resolution of our image, sometimes 
called downsampling, using striding during convolution. An alter- 
native is to do no striding (that is, move just 1 pixel horizontally and 
vertically as we move the filters), and the follow the operation with a 
pooling layer. In recent years, experience has shown that doing the 
downsampling with striding while convolving is faster and often gives 
results that are just as good, or better, so it’s become the more common 
idiom [Springenberg15]. Nevertheless, pooling was used by many pop- 
ular architectures that are still in common use today, so it’s important 
to be familiar with the technique. 


We'll see that we often run this process in reverse as well, increasing 
the number of pixels in an image, say from 100 on a side to 300 on a 
side. We can do this process, called upsampling, with an upsampling 
layer, as we discussed in Chapter 20. But just as with downsampling, 
we can do this increase in resolution while computing the convolution 
itself. We'll see how that’s done later in this chapter. 
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24.5 1D Convolution 


In our previous discussions we’ve discussed moving our kernel in 
two dimensions, both horizontally and vertically across our 3D input 
image (height, width, and one or more channels). As we discussed, a 
grayscale image has 1 channel, an RGB image has 3 channels, a CMYK 
image has 4 channels, and other color representations may have other 
numbers of channels. Recall that we don’t move the filter in depth, 
because the filter itself has as many channels as the input image. 


What if the input isn’t 3D? We can generalize our discussion to both 
smaller and larger tensors, which involves moving our tensor through 
fewer and more dimensions, respectively. 


An interesting special case is called 1D convolution. This involves 
moving the filter in only one direction [Snavely13]. This is a popular 
technique when working with text, where each row represents a single 
word, or a fixed number of letters [Britz15]. 


The basic idea is shown in Figure 21.39. We create one or more fil- 
ters that are the entire width of an input matrix, where we can place 
one word of our text in each row. Once we’ve computed an output for 
each filter, we move it one row downwards. The name “1D convolution” 
comes from this single direction, or dimension, of movement. 

























































































Figure 21.39: In 1D convolution, we create a filter with the full width of 
the grid, and sweep it downwards one row at a time. The name comes 
from moving the filter in only one direction (or dimension). 
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As we saw before, we can have multiple filters sliding down the grid, 
each of a different height. 


Any time we move our filter in just one dimension, we can call that “1D 
convolution.” It’s easiest to see with a 2D grid, but we can do the same 
thing with an input of any number of dimensions, as long as we only 
slide the filter along the input in one dimension. 


Because of its name, 1D convolution can be easily confused with con- 
volution with a 1 by 1 filter, often called 1 by 1 convolution, or 1x1 
convolution. These two ideas are very different. Let’s look at 1x1 
convolution now. 


24.6 1x1 Convolutions 


We’ve seen how to use multiple filters to create multiple outputs. And 
then future steps of convolution can use new filters that take the out- 
put of those filters as input. 


But what if we just want to combine the filter outputs in some way, 
without using a big footprint? 


For example, we might have one filter that spots red circles, and 
another that finds green lines, and we'd like to find those pixels with 
both a red circle and a green line (such pixels would be yellow). 


We can build a1 by 1 filter, often written as a 1x1 filter, and use that 
to perform 1 by 1 convolution [Lin14]. 


This is a filter with a footprint of just one pixel. It’s normal convolution, 
in that we sweep this filter over the input tensor, multiply the values in 
the input by the weights in the filter, produce a result, and save that in 
a new output tensor. The only difference is that the filter’s footprint is 
just a single pixel in both dimensions. Figure 21.40 shows this visually. 
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Figure 21.40: 1x1 convolution sweeps a filter with a footprint of 1x1 pixel. 


What’s the point of such a tiny filter? One powerful application of 1x1 
convolution is to do feature reduction on the fly. This is a familiar 
idea: in Chapter 12 we saw how to use algorithms like PCA to pre-pro- 
cess our data to reduce the number of features, and thus improve the 
performance of our algorithms. In this case, our 1 by 1 filter is speci- 
fying a way to project all of the values at an elment down onto a line, 
where we need only a single value to represent it. 


The value of such an operation is two-fold. First, there’s less data to 
process, so further calculations go faster and consume less memory. 
Second, the network is often able to produce better results because it 
can direct all of its computational power on useful information, rather 
than wasting it on redundant features. 


Let’s see how 1x1 convolution can pull off this trick for us. Suppose 
we start with a tensor that has 300 layers, as in Figure 21.41. We sus- 
pect that lots of the data in those layers is redundant, and we think we 
could get good results with just 175 layers. 
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Figure 21.41: Applying 1x1 convolution to perform feature reduction. 


We can make 175 filters, each with a 1x1 footprint. Each filter will look 
at all 300 values located at one pixel, and produce just a single value as 

a result. With training, each of these filters can extract one useful mea- 
surement from those 300 incoming values. The result is a tensor with 

only 175 layers, so all following operations will go nearly twice as fast. 
If our guess of 175 was right, then we could still get acceptable results 

from our network, but with less time and memory consumption. 


This process often works well in practice when the features are cor- 
related [Canzianii6]. That means that the filters on the previous 
layers have created results that are in sync with one another, so that 
when one goes up, we can predict by how much the others will go up 
or down. The better this correlation, the more likely it is that we can 
remove some the correlated layers and suffer little to no loss of infor- 
mation. The 1 by 1 filters are perfect for this job. 


Inserting layers that perform 1 by 1 convolutions into our convnets of 
many layers can improve their performance, as demonstrated by the 
architecture given the colorful name “Inception” [Szegedy14 ]. 
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24.7 A Convolution Layer 


We've talked a lot about convolution layers without saying much 
about how they work. Let’s address this now. 


When we create a convolution layer, we typically tell the library 
how many filters we want, what their footprint should be, and other 
optional details like whether we want to use striding and what acti- 
vation function we want to use, and the library takes care of all the 
rest. Most importantly, it improves the kernel values in each filter with 
backpropagation, learning the best values that make filters that pro- 
duce the best results. 


The output of a convolution layer is a tensor, with one slice for each 
filter, as we’ve seen above. Because each filter is designed to match a 
feature in the input, the output of a convolution layer is sometimes 
called a feature map. The word “map” comes from its mathemati- 
cal meaning. Here we can think of this “map” as telling us where each 


of the features are in its input, with larger values indicating increased 
likelihoods. 


When we draw a diagram of our model, we usually identify our con- 
volution layers by how many filters are used, their footprints, and the 
activation function. Usually the default is to apply no padding and use 
a striding of (1,1), so if we want different values for these options we 
include them explicitly. Since it’s common to use the same padding all 
around the input, we often just provide a single value rather than two, 
with the understanding that it applies to all dimensions. So a padding 
of 3 would stand for (3,3) in 2D, or (3,3,3) in 3D. 


Some libraries will automatically compute the amount of padding 
that’s needed in order to keep the output the same size as the input, so 
all we have to do is ask for padding, rather than tell it explicitly how 
much to use. Figure 21.42 shows our shorthand icon for two convolu- 
tion layers, along with the traditional box-and-text versions. 
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Figure 21.42: Two convolution layers. Each is shown in our schematic form, 
and traditional box-and-label form. (a) A convolution layer with 5 filters, 
each 3 by 3, and an ReLU activation function. The stride is implicitly (1,1), 
and there’s no padding. (b) The same layer as in part (a), but now with an 
explicit stride of (3,3), and a single ring of zero-padding around the input. 


24.7.1 Initializing the Filter Weights 


When we create a new convolution layer, all the necessary filters are 
also created. We haven’t learned anything yet, so we don’t know what 
weights should be in the filter kernels. But just like the weights for reg- 
ular artificial neurons, they have to be initialized with something. 


If two filters have identical weights, we’re wasting resources, since 
both would do the same job. We say that two such filters are symmet- 
rical. We want to prevent such filters from forming, so we definitely 
don’t want to initialize two filters with the same values. For this rea- 
son, assigning different values to every filter is called symmetry 
breaking. 


One approach to initialization with symmetry breaking is to use small 
random numbers, such as those from the range [-0.01,0.01]. This 
makes it unlikely that any filter will exactly duplicate any other filter. 


Research into initialization has led to other approaches. Two popu- 
lar techniques are both named for the lead author on the paper that 
described them. These are Glorot initialization (also called Xavier 
initialization) [Glorotio] and He initialization [He15]. Recall that 
we saw these different initializers in Chapter 16. 
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Both of these techniques work on the same general idea: the initial val- 
ues are random numbers that are chosen according to the number of 
inputs to the neuron. This is called the fan-in. Both Glorot and He 
choose random values from a distribution that uses the fan-in as a 
parameter [Jones15]. 


A nice feature of these approaches is that they take no other parame- 
ters, and in particular no user-specified parameters. We merely need 
to tell our library which initialization we want, and it takes care of the 
rest. Without going into the theory, the current recommendation is 
that if our library provides He initialization as an option, we should 
use it [Karpathy16]. Otherwise, Glorot or random values can be used 
instead. 


24.8 Transposed Convolution 


The convolution layers we’ve looked at so far either maintain the size 

of their input or make it smaller. But we can use the same technique to 

also make the input tensor larger. This process is called upsampling. 
When we do it inside of a convolution step, it’s called transposed 

convolution or fractional striding. The word “transposed” comes 

from the mathematical operation of transposition, which we can use to 

write the equation for this operation. We'll see where “fractional strid- 
ing” comes from below. 


Some authors call upsampling while convolving deconvolution, but 
that name is already taken by a different idea [Zeiler10]. To avoid con- 
fusion, most people avoid that term now, preferring either of the other 
terms above. In this chapter, we'll use the term transposed convolution. 


Let’s see how this works. We'll begin by revisiting basic convolution 
without padding, creating an output smaller than the input. In Figure 
21.43 we move a 3 by 8 filter over a 5 by 5 input, resulting in a 3 by 3 
output. 
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Figure 21.43: Convolving a 5 by 5 input image with a 3 by 3 filter. Here we 
aren't using any padding, so the filter cannot be centered on the outer- 
most ring of pixels. The outer shapes show the original image and the 
footprint of the 3 by 3 filter as it moves over the image. The central figure 
is the resulting 3 by 3 image. 


If we use a single ring of zero-padding the original 5 by 5 image, then 
using the same filter will give us a 5 by 5 image, as in Figure 21.44. 
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Figure 21.44: The same setup as in Figure 21.43, only now our input 5 by 
5 image has been zero-padded by 1 element. We show several represen- 
tative placements of the 3 by 3 filter, and the element they produce. The 
output image is 5 by 5, so the input and output have the same sizes. 


If we use striding with our filters, then the output will again become 
smaller than the input. For instance, a 60 by 60 padded input with a 
stride of 3 in each direction will produce an output of size 20 by 20. 


Now let’s produce larger inputs than we started with [Dumoulin16]. 
Suppose that we have a starting image of dimensions 3 by 3, and we’d 
like to process it with a 3 by 3 filter. We'd like to end up with a 5 by 5 
image. All we have to do is pad, or surround, the input with two rings 
of O’s, as in Figure 21.45. 
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Figure 21.45: Using transposed convolution, our output image can be 
larger than our input. Our original 3 by 3 input is shown in white in the 
outer grids, padded with two elements of 0’s all around. The 3 by 3 filter 
now produces a5 by 5 result, shown in the center. 


We could make the output even larger with more rings of 0’s, but that 
will just produce rings of 0’s in the output. 


Another way to get a larger result is to spread out the input image, by 
inserting padding both around and between the input elements. This 
is called dilated convolution. Let’s insert a single row and column 
of 0’s between each input of or starting 3 by 3 image, and pad all of 
that with 2 rings of o’s. This turns our 3 by 3 input in a 9 by 9 grid. 
When we sweep our 3 by 3 filter over this grid, we'll get a 7 by 7 output. 
Thus we’ve enlarged our original 3 by 3 input into a 7 by 7 output, as 
shown Figure 21.46. 
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Figure 21.46: Dilated convolution. Our original 3 by 3 image is shown in 
the outer grids with white pixels. We've inserted a row and column of 0’s 
between each pixel, and then surrounded the whole thing with two rings 
of 0’s. When we convolve our 3 by 3 filter with this grid, we get a 7 by 7 
result, shown in the center. 


Let’s make our output even bigger by inserting two rows and columns 
between each original input element, as in Figure 21.47. Now our input 
is 11 by 11, and the output is 9 by 9. 
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Figure 21.47: The same setup as Figure 21.46, only now we have two rows 
and columns between our original input pixels, producing the 9 by 9 
result in the center. 


We can choose as many rows and columns of 0’s as we like between 
our original elements, and as many rings of 0’s as we like around them. 
But we have to keep the size of our filter in mind. If we place, say, 3 
rows and columns of 0’s between each input pixel and we use a filter 
that is 3 elements wide, it will introduce a grid of vertical and horizon- 
tal lines of o’s in the output. We rarely want this, so we usually keep 
the number of extra rows and columns to less than the size of the fil- 
ter. In this case, using 2 rows and columns is as many as we'd want to 
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apply before convolving with this 3 by 3 filter. This technique of insert- 
ing O’s isn’t foolproof, and can create little checkerboard-like artifacts 
in the output tensors. But these can be avoided by library routines if 
they take steps to handle the convolution and upsampling carefully 
[Odena16]. 


There is a connection between transposed convolution and striding. 
With some imagination, we can describe a transposed convolution 
process like that of Figure 21.47 as using a stride of 1/3 in each dimen- 
sion. We don’t mean that we literally move 1/3 of a pixel, but rather 
than we need to take 3 steps to move the equivalent of one step in the 
original input. This point of view leads some authors to refer to trans- 
posed convolution as fractional striding. 


Just as striding lets us combine convolution with a downsampling 
step, transposed convolution (or fractional striding) lets us combine 
convolution with an upsampling step. Both the downsampling and 
upsampling steps can be performed by a layer of the same name. Doing 
the downsampling or upsampling during convolution can produce 
slightly different results than using a separate layer, but experience has 
shown that it’s usually faster and more efficient to do them together, 
and it usually produces results that are just as good [Springenberg15]. 


24.9 An Example Convnet 


To demonstrate how we can use convolution layers in practice, let’s 
look at a couple of image classifiers. The first will identify grayscale 
handwritten digits, and the second will identify what object is featured 
in a color photograph, choosing from 1000 different categories. 


Categorizing handwritten digits is a famous problem in machine learn- 
ing [LeCun89]. We'll begin by categorizing the handwritten digits in 
the MNIST data set. 
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The MNIST data set collects tens of thousands of hand-written digits 
from a wide variety of people. The digits are from 0 to 9, and each is 
saved as a grayscale image that’s 28 by 28. Our job is to identify the 
digit in each image. 


We'll use a simple convnet designed for this job that is included with 
the Keras machine learning library [Cholleti7a]. The architecture is 
shown in Figure 21.48 in both our schematic form, and the traditional 
box-and-label form. 
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Figure 21.48: A convent for classifying MNIST digits. The input images 
are 28 by 28 by 1 channel. Two convolution layers are followed by pooling, 
dropout, and flatten, then a dense layer (with dropout) and a final softmax 
layer with 10 outputs. This network is due to [Chollet17a]. Top: Our sche- 
matic version. Bottom: Traditional box-and-label form. 


The input to the net is the MNIST image, with resolution 28 by 28 by 
1. The heart of the convnet is in the first two layers, which perform the 
convolutions. 


The first convolution layer runs 32 filters of size 3 by 3 over the input. 
The result of each filter’s output is run through a ReLU activation func- 
tion before it leaves the layer. 
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Keep in mind that we merely tell the system that we want 32 filters 
with a 3 by 3 footprint. It recognizes that the input has just 1 chan- 
nel, so it creates filters with dimensions 3 by 3 by 1. Since we don’t 
specify an initialization method, the library uses its default (in Keras, 
this is Glorot initialization). The stride is left at the default of 1 in each 
direction, and we apply no padding. As we saw above, this means that 
the outermost ring of pixels won’t make it to the output, so the output 
image will be 26 by 26, but that’s okay because all MNIST images are 
supposed to have a border of 4 black pixels around the digit (not all 
images have this border, but most do). 


So the first layer’s input tensor is 28 by 28 by 1, and the output tensor 
is 26 by 26 by 32. 


The second step is another convolution layer, this time with 64 filters 
with a 3 by 3 footprint. The system knows that the input has 32 chan- 
nels, so each filter is created with size 3 by 3 by 32. The input tensor 
is 26 by 26 by 32. Because we're using the default stride and padding 
here as well, we lose another ring around the outside of image, pro- 
ducing an output tensor of shape 24 by 24 by 64. 


We could have used striding to reduce the size of the output, but here 
we use an explicit max pooling layer with blocks of size 2,2. That means 
for every non-overlapping 2 by 2 block in the input, the layer outputs 
just one value containing the largest value in the block. Thus the out- 
put of this layer is a tensor of size 12 by 12 by 64 (the pooling doesn’t 
change the number of channels). 


Next comes a dropout layer. Since there are no neurons in the max 
pooling layer, this affects the most recent convolution layer, here the 
one with 64 filters. On each epoch, one-quarter of the neurons in this 
layer will be temporarily disabled. This should help hold off overfitting. 


The dropout layer is really just instructions to the code running the 
network, and it doesn’t perform any operations. So the tensor coming 
out of the dropout layer is the same as what went in, or 12 by 12 by 64. 
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Now we leave the convolutional part of the network, and prepare the 
values for output. These steps, or something like them, are typically 
found at the end of a classification convnet. 


First we flatten the input tensor so it’s a big list of 12x12x64=9216 
numbers. That list goes into a fully-connected, or dense, layer of 128 
neurons. That layer also gets affected by dropout, where a quarter of 
the neurons are temporarily disconnected at the start of each epoch. 


The 128 outputs of this layer go into a final dense layer with 10 neu- 
rons. The 10 outputs of this layer go into a softmax step, so that they’re 
converted to probabilities. The 10 numbers that come out of this last 
layer give us the network’s prediction of the probability that the input 
image is the corresponding digit. 


We trained the network for 12 epochs using the standard MNIST 
training data. Its accuracy on the training and validation data sets are 
shown in Figure 21.49. 


Accuracy 


— training accuracy 
— validation accuracy 


Epoch 





Figure 21.49: The training performance of our convnet in Figure 21.48. 
We trained for 12 epochs, and since the training and validation curves 
are not diverging, we've successfully avoided overfitting, while reaching 
a tiny bit more than 99% accuracy on both data sets. 


994 


Chapter 21: Convolutional Neural Networks 


The curves show we're at about 99% accuracy on both the training and 
validation data sets. Since the curves aren’t diverging, we’ve success- 
fully avoided overfitting. 


Let’s look at some predictions. Figure 21.50 shows some images from 
the MNIST validation set, and the digit corresponding to the network’s 
largest probability. On this little set of examples, it did a perfect job. 
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Figure 21.50: These are 24 randomly-chosen images from the MNIST 
validation set. Each image is labeled with the output of the network, 
showing the digit with the highest probability. In fact, the probabilities 
for all of these images were almost 1 for the shown label, and nearly O for 
all others. The network classified all 24 of these digits correctly. 


This little convnet is doing most of its work in the two convolution lay- 
ers, and they each just ran little 3 by 3 filters over the image. But that 
was enough to enable the system to correctly identify 99% of the digits 
in the validation set. 


Thanks to how well this technique performs, convolution has become a 
staple technique for deep learning architectures that work with images 
and volumes, as well as the other applications we mentioned at the 
start of this chapter. 


In Chapters 23 and 24 we'll look at how to actually write code in the 
Keras library to build convnets. 
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24.9.1 VGG16 


Let’s now look at a much bigger and more powerful convnet, capable 
of identifying 1000 different objects in color photographs. 


The ILSVRC2014 competition was a public challenge in 2014, asking 
people to build the best neural network they could for classifying a pro- 
vided database containing a huge variety of images [Russakovsky15]. 
The acronym ILSVRC stands for “Imagenet Large Scale Visual 
Recognition Challenge.” 


The training data contained 1.2 million images, each manually labeled 
with one of the 1000 objects that the network should be able to recog- 
nize. The challenge actually included several sub-challenges, each with 
its own winners [Imagenet14]. The winner of one of the classification 
tasks was a network called VGG16 [Simonyan14]. VGG is an acronym 
for the “Visual Geometry Group” who developed the system. The 16 
refers to the network’s 16 computational layers (there are also some 
utility layers for pooling and padding). 


The VGG16 system has become popular for working with image 
classifiers. The authors have released all the weights and how they 
pre-processed the training data, and the network itself is easy to under- 
stand, modify, and use as a starting point for other networks. 


So we can easily create a full version of VGG16 in our own code and use 
it right away to classify images, with no training time. But if we want 
to experiment with the system, or teach it new tricks, we can start with 
the trained model and change it. 


Let’s look at the VGG architecture. As with the MNIST system, most of 
the work is done by a series of convolution layers. Utility layers appear 
along the way, and then there’s some flattening and dense layers at the 
very end. 


Before each convolution, we pad the image with zeros so that we don’t 
lose pixels around the outside. 
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The convolution layers (with their zero-padding) come in sequences of 
2 or 3 of the same repeated layer. At the end of each of these sequences 
there’s a max pooling layer with a pooling size of 2 by 2 and a stride of 
2 in each dimension, so the output tensor is cut down by half in both 
width and height. 


Before we feed any data to our model, we must pre-process it in the 
same way that authors pre-processed their training data. That involves 
making sure the image is sized at 224 by 224, and each channel has 
been adjusted by subtracting a specific value from all of its pixels 
[Simonyani4]. Once that’s done, we’re ready to feed our image to the 
network. 


We'll present the VGG16 architecture as a series of 6 blocks. Figure 
21.51 shows the first stage. The input is a color image of 3 channels 
(one each for the red, green, and blue values of each pixel) of size 224 
by 224. That input image goes into two sequential padding-convolu- 
tion steps, and is then reduced by half in each dimension with a max 
pooling step. 


Input: 
224 x 224x3 







2X2, 
stride 2x2 






64 x (3x3) 
ReLU 





64 x (3x3) 
ReLU 


Figure 21.51: The first block of VGG16. The input is zero-padded with a 
single ring of zeros, and then we apply 64 filters each of size 3 by 3. Then 
we zero-pad that result and run another set of 64 filters over it, each 
again with a footprint of 3 by 3. Finally, we use max pooling to reduce the 
size of each of the original image’s dimensions by half. 


We can see in Figure 21.51 that we’ve grouped the initial processing 


into two identical chunks of a pooling step followed by convolution. 
The output is a tensor of size 112 by 112 by 64. 
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This tensor then flows into the next block, shown in Figure 21.52. This 
is just like block one, only we apply 128 filters in each convolution layer. 






128 x (3x3) 
ReLU 


128 x (3x3) 
ReLU 





stride 2x2 


Figure 21.52: The second block of VGG16 is just like the first block in 
Figure 21.51, except that we use 128 filters in each convolution layer rather 
than 64. 


Block 3 continues the pattern of doubling the number of filters in each 
convolution layer. But it also repeats the padding-convolution group- 
ing three times instead of 2. Figure 21.53 shows block 3. 


(TaD 
2] 


256 x (3x3) 
ReLU 









256 x (3x3) 
ReLU 





256 x (3x3) 
ReLU 





stride 2x2 














Figure 21.53: Block 3 of VGG16 doubles the number of filters again to 
256, and repeats the padding-convolution step 3 times rather than 2 as 
before. 


Blocks 4 and 5 of the network are the same. Each block is built from 


three pairs of padding and convolution, followed by a max pooling 
layer. The structure of these layers is shown in Figure 21.54. 
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512 x (3x3) 
ReLU 





Ble X% (3x3) 
ReLU 


512 x (3x3) 


ReLU stride 2x2 











Figure 21.54: Block 4 and Block 5 of VGG16 are the same. They each have 
three pairs of padding and convolution, followed by a 2 by 2 max pooling 
step. 


This ends the convolution blocks, and now we come to the wrap-up. 
As with the MNIST network, we first flatten the tensor coming out of 
Block 5. We then run it through two dense layers of 4096 neurons, 
each followed by dropout with an aggressive setting of 50%. Finally, 
the output goes into a dense layer with 1000 neurons, one for each 
category. The results are fed to softmax, which produces our output of 
1000 probabilities, one for each class. Figure 21.55 


Output: 
1000 


4096 05 4096 05 1000 
ReLU ReLU softmax 


Figure 21.55: The final steps of processing in VGG16. We flatten the image, 
then run it through two dense layers each using dropout. Then we enter 
a dense layer with 1000 neurons, and pass its output through softmax. 
The result is a list of 1000 probabilities, one for each class the image 
could belong to. 


Figure 21.56 shows the whole architecture in one place. 


999 


Chapter 21: Convolutional Neural Networks 






Input: 
3 X 224 x 224 












2x2, 
stride 2x2 


2x2, 
stride 2x2 


2x2, 
stride 2x2 


















2x2, 
stride 2x2 


2x2, 4096 0.5 4096 0.5 
stride 2x2 ReLU ReLU softmax 


3 times 3 times 


Figure 21.56: The VGG16 architecture in one place. 


If we were to rebuild VGG16 today, we would probably remove the max 
pooling layers, and instead use 2 by 2 striding in the last convolution 
layer of each repeated chunk. 


In Chapter 20 we saw many examples of using VGG16 on images it 
hadn’t seen before. For fun, Figure 21.57 shows four more pictures 
shot around Seattle on a sunny day. The convnet has never seen these 
images, even during validation, but it does a great job with them. 
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Figure 21.57: Four photos shot around Seattle on a sunny day. The convnet 
of Figure 21.56 does a great job of identifying each image. 


21.9.2 Looking at the Filters, Part 1 


VGG16 does a great job at classifying images, thanks in large part to 
the filters that were learned by its convolution layers. It seems like it 
would be instructive to look at the filters and see what they’ve learned. 


But the filters themselves are just 3 by 3, which is too small for us to 
make much sense of them. But we can see them indirectly by looking 
at images that trigger each filter. In other words, once we’ve selected a 
filter we want to visualize, we can find a picture that causes that filter 
to output its biggest value. 
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We can do this with a little trick that is based on gradient descent, the 
algorithm that we used in Chapter 18 as part of backpropagation. But 
now we'll use gradient ascent to climb up the gradient. The technique 
starts with an image full of random noise. We measure the output of the 
filter we’re interested in, and then we use the gradients in the network 
to modify the pixels in the input image. We don’t touch the weights or 
anything else in the network itself, since it’s not learning anything. But 
were using the gradients to tell us how to modify the pixels so that the 
input image stimulates that filter a little bit more. We do this over and 
over until the filter’s output is as large as we can make it [Zeiler13]. 


9 


In a sense, the resulting image is what the filter is “looking for.’ 
Because we start with random values, we'll get a different final image 
every time, though they'll all be similar because they’re all based on 
maximizing the same filter. 


Let’s look at some images produced by this method. Figure 21.58 shows 
pictures produced for the 64 filters in the 2nd convolution layer in the 
first block of VGG16 (we'll use the label block1_conv2 for this layer, 
and similar names for the other layers we'll look at). 
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Figure 21.58: Images that get the biggest response from each of the 64 
filters in the block1_conv2 layer of VGGG16. 


It seems like a lot of these layers are looking for stripes of different 
widths and orientations, which would be a good way to find edges. A 
few seem to be looking for borders of different types, and a bunch have 
values that are too subtle for us to make much of, though the filter in 
the bottom-right looks like a creepy thing with many eyes. 
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Let’s move up to block 3, and look at the first 64 filters from the first 
convolution layer there. Figure 21.59 shows the images that stimulate 
these filters the most. 
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Figure 21.59: Images that get the biggest response from the first 64 
filters in the block3_conv1 layer of VGGG16. 


Now we're talking! The filters here seem to be looking for more com- 
plex textures, though there are still lots of stripes and stripy patterns. 
Let’s move on even higher and look at the first 64 filters from the first 
convolution layer of block 4, in Figure 21.60. 
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Figure 21.60: Images that get the biggest response from the first 64 
filters in the block4_conv1 layer of VGGG16. 


These are getting interesting. The filters seem to be hunting for pat- 


terns that involve a lot of different kinds of flowing and interlocking 
shapes. 


Just for fun, let’s look at close-ups of a few of these filters. Figure 21.61 
shows larger views of 9 patterns from the first few layers. 
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Figure 21.61: Close-ups of some manually selected images that triggered 
the largest filter responses from the first few layers of VGG16. 


Figure 21.62 shows patterns that triggered big responses from filters 
in the last few layers. 
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Figure 21.62: Close-ups of some manually selected images that triggered 
the largest filter responses from the last few layers of VGG16. 


These patterns are exciting and beautiful. They also have an organic 
feeling about them, probably because VGG16 was trained on a huge 
number of images of animals. 
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21.9.3 Looking at the Filters, Part 2 


Another way to look at a filter is to run the same image through VGG16, 
and look at the image produced by that filter. That is, we feed an image 
to VGG16 and let it run through the network, but we ignore the net- 
work’s output and instead extract the output produced by the network 
we re interested in, and we draw that. 


Let’s give it a spin. 


Figure 21.63 shows our input image of a duck. We'll use this for all of 
our images in this section. 





Figure 21.63: The duck image that we'll use to visualize filter outputs. 


To get a feeling for things, Figure 21.64 shows the response from the 
very first filter on the very first convolution layer of the network. Since 
the output of a filter has just one channel, the image is no longer in 
color. We gave it a heatmap from black to reds to yellow to show the 
value of each element from 0 to 255. 
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Figure 21.64: The response of filter 0 in layer blocki_conv1 in VGG16 to 
the duck image in Figure 21.63. 


It looks like this filter is trying to find vertical edges. When they get 
darker from top to bottom or left to right, we get a big response. When 
they get lighter in those directions, we get a very small response. 


Figure 21.65 shows the responses from the first 32 filters in the first 
convolution layer of the first block. 
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Figure 21.65: The responses of the first 32 filters in VGG convolution 
layer block1_convl. 


A lot of these filters seem to be looking for edges, but others seem to be 
looking for particular features of the image. Let’s look at close-ups of 8 
manually selected filters chosen from all 64 of the filters on this layer, 
shown in Figure 21.66. 




















Figure 21.66: Close-ups of 8 manually chosen filter responses from 
VGG16’'s first convolution layer, block1_convl. 
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The third image in the top row seems to be looking for the duck’s 
feet, or maybe it’s just interested in bright orange things. The left- 
most image in the bottom row looks like it’s searching for the waves 
and snow behind the duck, though the image to its right appears to be 
responding most to the blue waves. 


Let’s move farther into the network, out to the third block of convolu- 
tion layers. The outputs here are smaller by a factor of 4 on each side 
than those coming out of the first block, because they’ve gone through 
two pooling layers of size 2,2. We’d expect that they will be looking 
for clusters of features, and less directly tied to the duck itself. Figure 
21.67 shows the responses for the first convolution layer in block 3. 





Figure 21.67: The responses of the first 32 filters in VGG convolution 
layer block3_convl. 


It’s interesting that there still seems to be a lot of edge finding going on. 
It seems that strong edges are an important cue for VGG16 as it works 
to figure out what an image is showing. But there are lots of regions 
that are bright, perhaps where the texture of the image matches one or 
more of the patterns that the filters are looking for. 


Let’s jump all the way to the last block. Figure 21.68 shows the 
responses for the first 32 filters for the first convolution layer in block 
5. 

1011 


Chapter 21: Convolutional Neural Networks 








Figure 21.68: Filter responses for the first 32 filters in VGG convolution 
layer block5_convl. 


As we'd expect, these images are even smaller, having passed through 
two more pooling layers that each reduce the size by a factor of 2 on 
each side. At this point the duck is hardly visible, as the system is com- 
bining features from the previous layers. Some of the filters are barely 
responding. They are probably responsible for finding high-level fea- 
tures that aren’t present in the duck image. 


In Chapter 28 we'll look at a couple of creative applications that use 
the filter responses in a convnet to create art. 


21.10 Adversaries 


There’s a surprising thing that we can do to our images that will throw 
off VGG16’s predictions. In fact, this trick will mess up the results of 
any classifier. 


The trick to “fooling” our convnet involves creating a new image called 
an adversary. This image is created from the starting image by 
adding an adversarial perturbation (or more simply, a perturba- 
tion), which is an image that looks like random noise to us. If the same 
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perturbation works for every image we give to a particular classifier, or 
even to every image we give to every classifier, we call it a universal 
perturbation [Moosavi-Dezfooli16]. 


Suppose that when we get ready to hand a picture to our classifier, we 
first add this perturbation to it, pixel by pixel. The changes are so small 
that we can’t see any difference, even when looking at the before and 
after pictures side by side. But the little changes to the pixels are just 
right to cause the classifier to completely mess up and predict what 
appears to be a random category. 


For example, on the left of Figure 21.69 we see an image of a tiger. The 
system correctly classifies it as a tiger with about 80% confidence, with 
a little bit of confidence for related animals such as a tiger cat (a small 
forest cat) and a jaguar. 
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Figure 21.69: An adversarial attack on an image. Left: At top, a picture 
of a tiger. Below it are the classes predicted by VGG16, with their proba- 
bilities. Middle: An image created by a program that wants to cause the 
tiger to be misclassified. We've expanded the pixel values so they're visu- 
ally readable in the figure, but the values at the top show that the values 
are in the range of about -2 to +2. Right: At top, the result of adding 
the middle image to the tiger in the upper left. Visually, the tiger looks 
unchanged. Below, the results of VGG16 on the image above. It’s not even 
recognized as an animal! 


In the middle of Figure 21.69 we show an image computed by an algo- 
rithm designed to find adversaries. We’ve cranked up the values so we 
can see them better, but the numbers at the top show that the pixels 
are all in the range of about —2 to +2 (the tiger’s pixels are all in the 
range O to 255). When we add this seemingly noisy image to the tiger, 
we get the image on the right. To our eyes, it seems unchanged. But 
look at the change in the classes! The system doesn’t even think that 
this is an animal. 
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The amazing thing about adversaries is that we can compute them for 
any image. Figure 21.70 shows a perturbation for an image of a power 
drill over a background of wrenches. The range of pixels here is even 
smaller than before, but the outputs have changed entirely. A syringe? 
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Figure 21.70: Applying a perturbation to an image of a power drill. 


Let’s look at one more example. In Figure 21.71 before we added the 
perturbation, the system was essentially certain that this was a toucan. 
After adding the perturbation, it was about 40% sure it was a peacock. 
The other possible classes are all lizards. 


1015 


Chapter 21: Convolutional Neural Networks 





rt TST x Oo YW MO 
o6 23 3 7ao35% 
c = D YD OD oOo o 3 0 
0 57a 0 © 0 30 oO 
Y OH VDT o2>O03n 5 
5 =o = 0 qav.a nA 
= a '. © 
Co) ) 
= yo g 
— Qo 
S 
OD 


Figure 21.71: Applying an adversarial attack to an image of a toucan. 


A variety of algorithms have been developed for creating adversarial 
images [Rauber17a]. The range of values in the perturbations these 
methods create for a given image can vary considerably, so to find the 
smallest perturbation it’s often worth trying a few different methods. 
We can tell these methods, called attacks, what criteria they should 
use in order to measure success. For example, we can ask for a pertur- 
bation that simply causes the input to be misclassified. Another option 
asks for a perturbation that will cause the input to be classified as a 
particular class. For the images above, we asked for a perturbation 
such that none of the original top 7 classes assigned by the classifier 
would show up as the top class for the adversary. Many of these meth- 
ods have been implemented in a Python library that we can use with 
any type of classifier and any image [Rauber17b]. 
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Adversarial perturbations have to be carefully constructed. And they 
may be an inevitable weakness of CNNs [Gilmer18 ]. But their existence 
suggests that convnets still hold surprises for us, and they shouldn’t be 
considered foolproof. There’s more to be learned about what’s going 
on inside of CNNs. 
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22.1 Why This Chapter Is Here 


In most of this book we’ve considered every sample as a standalone 
entity, unrelated to any other samples. 


This makes sense for things like photographs. If we’re classifying an 
image and decide that we’re looking at a cat, it doesn’t matter if the 
image before or after this one was a dog, or squirrel, or airplane. The 
images are independent of each other. 


But if an image is a frame of a movie, then it can be helpful to look at it 
in the context of the other images that precede and follow it. That can 
help us understand what’s happening in the frame, and even infer the 
presence of objects that might temporarily be obscured by someone’s 
body. 


When data is arranged so that each piece has some kind of relationship 
the pieces that come before and after, we call that a sequence. In this 
chapter, we'll look at how we can work with these kinds of sequences 
to derive meaning from them. 


For example, we might have a series of readings of the noontime tem- 
perature of a specific place over a period of days, or the time of high 
tide each day, or the price of a stock at the close of trading. An import- 
ant type of sequence is language. We can think of written or spoken 
language as a stream of letters or words, or larger units like sentences 
and paragraphs. 


Making sense of these sequences is much easier if we can look at the 
entire collection. For instance, if someone says “He told me I could 
have one of these strawberries,” then we need to look at previous pieces 
of the stream of words to work out who “he” refers to. Understanding 
the sentence, “She said to wait over there,” requires even more context, 
or surrounding information, to be understood. 
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With access to both what’s come before each piece of information (and 
maybe even what comes after), we can translate someone’s speech 
into another language, ideally conveying the sense of what they’re say- 
ing with a well-constructed expression in the new language, rather 
than just converting each word, one at a time, in isolation (although 
doing just that, with attitude, can produce enduringly funny results 
[Twaino3]). 


Algorithms that understand and process sequences have another 
bonus: they are frequently capable of generating new sequences. 
Once such a system has been trained, we can generate a poem or a 
story, perhaps even in the voice of a famous writer [Deutschi6a], or 
make new scripts for TV sitcoms [Deutschi6b]. Starting with a few 
notes, we can generate a whole song. It can be a single melody like an 
Irish jig or reel [Sturm17], a polyphonic melody [LISA17], or a com- 
plex song with melody and chords [Johnson17] [O’Brieni7]. We can 
create lyrics, if we want them [Krishan17]. We can even specialize, for 
example creating pop music [Chu17], folk music [Sturm15], rap lyrics 
[Barrati17], or country music lyrics [Moocarme17]. 


Other prominent uses of RNNs are in speech to text systems 
[Geitgey16] [Gravesi3], and generating captions for images and vid- 
eos [Karpathy13] [Mao14]. 


This chapter presents a learning architecture that is explicitly designed 
to learn from sequential data. 


This architecture is called a recurrent neural network, or most fre- 
quently simply an RNN. It’s also often called an LSTM, because that 
specific technique, which we'll see later in detail, has become a widely 
popular choice of implementation. In this chapter we'll see what makes 
an RNN special, how to build one, and how to use it. 
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22.2 Introduction 


We would like our machine learning systems to be able to understand 
sequences of information, rather than the isolated, independent pieces 
of data we’ve seen until now. 


Before we start looking at new stuff, let’s try to extract meaning from 
sequential data using just our familiar dense, or fully-connected, layers. 


Suppose that we have a sequence of temperature measurements over 
the course of a day. We’d like to train the system to take four sequential 
measurements and predict the fifth. Each sample has just one feature, 
describing the temperature, as in Figure 22.1. 


features 


samples 


N OO FO fF W WD 





Figure 22.1: Our starting data consists of samples that each contain one 
feature, giving us the temperature at a particular time. 


We can pack up the first four samples (let’s just call them “values” for 


now) into one big combined sample. The fifth value will be the target 
we want the network to predict for this sample, as in Figure 22.2. 
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1 2 3 
2 3 4 
3 4 2 
4 5 6 
o 6 7 





Figure 22.2: Combining multiple sequential samples into new larger 
samples. Left: We create a sample from values 1 through 4, and use value 
5 as the target. Middle: Our sample uses values 2 through 5, and use 
value 6 as the target. Right: The sample uses values 3 through 6, and use 
value 7 as the target. 


The next sample will have values 2 through 5, and the target is value 
6. Then we combine values 3 through 6, with the target of value 7, and 
so on. Our hope is that the network will learn something about the 
relationships between the values we're giving it that will let it better 
predict the target. 


This is called a windowed dataset, since we’re using a window of 
size 4 to create new, combined inputs for our network. In this example, 
we re using the common technique of overlapping windows, where 
each successive sample contains some of the values used in the previ- 
ous sample. 


To learn from this dataset we might use a simple network like that of 
Figure 22.3. 


10 10 1 
ReLU ReLU linear 


Figure 22.3: A little regression network to learn from our windowed 
dataset, and predict new values from the sequence embedded in each 
sample. 
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This network won’t be able to to learn from the ordering of the samples. 
If we’d chosen to build our windows by collecting the value in some 
other order, as in Figure 22.4, this network would ultimately produce 
the same results. 


3 4 5 
1 2 3 
4 5 6 
2 3 4 


st} eLJ 7 


Figure 22.4: Combining our original values into samples, but we'll consis- 
tently store the values in an order that doesn’t match their original 
sequence. 


If we think of this in terms of words, we might expect the fragment 
“T put on my hat and” to be followed by the word “gloves,” but there 
wouldn’t be any reason to expect that the scrambled fragment “hat on 
my I and put” should also be followed by “gloves.” 


In language, the order of words matters, and that’s true for most 
sequential data. But our little network of dense layers will work just as 
well whether our values are in the order of Figure 22.2. or Figure 22.4. 


There has been some exciting work on using CNNs to handle sequence 
data [Chen17b][vandenOord16], but those tools are still developing. 


So let’s instead look at a tool specifically designed to handle sequences. 
This tool does its work efficiently and elegantly, and powers almost any 
approach today that involves sequences of any kind in the input, out- 
put, or both. 
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22.3 State 


We will want to create a computational unit that is capable of remem- 
bering some internal information over time. We'll call the information 
the unit’s state. 


To retain this state, we'll introduce a new type of component, illus- 
trated in Figure 22.5. 


Output 


RNN 


Internal 
Memory 


Input 





Figure 22.5: An RNN uses internal memory to process its input into 
output. The internal memory is able to change with every input, even 
after training is complete. 


It’s mostly just a bunch of neurons and activation functions, but they’re 

wired up in a specific way and treated as a single unit. There’s also a bit 

of memory in there, which we'll get to below. We just pop this unit into 

our network as a layer like any other. As we'll see, the two most pop- 
ular versions of this recurrent unit go by the acronyms LSTM and 

GRU, and they’re what we use to build a recurrent neural network, or 

RNN. We'll look at them more closely below. 


What makes these units special is that they save information. This 
means each time they process an input, they can use some of what 
they’ve learned from previous inputs. 
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RNNs can be thought of as roughly inspired by real biology, or just 
as pieces of computational machinery. We’ll be taking the latter view, 
but some presentations focus on biologically plausible motivations for 
their structure. A discussion of these differences, plus a history of the 
development of RNNs, is offered by Lipton et al. [Lipton15]. 


In the networks we’ve seen before, once training is over, the weights 
are frozen. This means that the network is done learning and done 
changing. We feed in values, and it simply applies the operations that 
make up the network, using the values it’s learned. With the exception 
of random numbers used by some algorithms, it will process any par- 
ticular input exactly the same way every time. 


But our new units are able to remember new information after train- 
ing has completed. That is, they’re able to keep changing after training 
is Over. 


That’s essential if we’re going to learn how to translate a sentence. For 
example, if the sentence is, “Mary said her shoes are too tight,” we need 
to somehow remember that Mary is the subject, or we won’t be able to 
make sense of the word “her.” A later sentence might be, “Alice put 
her keys on the table.” Now the word “her” refers to Alice. This kind of 
remembering has to be done while evaluating the input, since we can’t 
predict what “her” will mean in any given sentence when we're train- 
ing. This is what makes our new structure special. 


The beauty of the system is that like so many other operations our 
neural networks perform, the network itself will learn what it needs to 
remember and how. We just have to set up the structure to enable it to 
do the job. 


The key idea here is the memory that we referred to above. It’s import- 
ant to distinguish this kind of memory from the weights in the network. 
We could certainly say that the weights are a kind of memory, since 

they’re saved with the network and persist. But they don’t change with 

the input. Once learning is done, the weights are fixed until we go back 
to learning again. The kind of memory we're talking about is different 

because it changes when the network is deployed and in actual use. 
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The information we save and read in this memory is collectively 
called the state. Sometimes the word is used in larger phrases, like 
“the state of the system” or “the state of the computation.” The word 
“state” is used here as a rough synonym for “situation” or “configura- 
tion,” but with the caveat that we’re referring to the remembered data 
that describes that situation or configuration. We say that we “save the 
state” and then “read the state.” 


Conceptually, the state can be written to a file, saved on a USB stick, or 
transmitted over a network. In RNNs, we keep it inside the recurrent 
unit. 


22.3.1 Using State 


Let’s consider an example of a system that reads and writes its state 
over time, the way our RNNs will. 


Imagine a robot that fixes cell phones. The robot sits at a counter with 
a parts cabinet next to it. If we place a broken cell phone in front of the 
robot, it will analyze the phone to work out what’s wrong, and then use 
its tools and the parts in the cabinet to fix it. This means it will prob- 
ably take some parts out of the parts supply. It might even add some 
parts back in. For instance, if it can replace a complicated bit of the 
phone with something simpler, then any pieces that were removed but 
are still functional could be added to the robot’s supply of parts. 


The robot’s parts supply is its state, or persistent information. When 
it starts each repair, it considers what parts are available in the state to 
help decide how to fix the phone. When the repair is done, some parts 
have probably been removed, and others might have been added, pro- 
ducing a new version of the state. So this one supply of parts changes 
after each repair, and becomes the starting supply of parts for the next. 


We might draw this as in Figure 22.6. 
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Repaired 
Phone 
[rane ->{ re [rae jr 
Broken 
Phone 





(a) (b) 


Figure 22.6: Our robot that fixes cell phones with the help of a box of 
parts. (a) The robot is given a broken phone and a box of parts. (b) The 
robot returns to us the repaired phone, and it updates the contents of 
the parts box to match the parts that were removed and others that may 
have been added. Note that communications with the parts are always 
with the same supply, so they’re shown with a different style of arrow. 


This figure introduces an important convention that we'll be using 
in future diagrams. When we’re showing the flow of information (in 
this case, information about the phone), we use a single line with an 
arrowhead. But operations that involve the state are a fundamentally 
different idea, because there’s just one state that is getting used and 
then updated over time. These operations are shown with an outlined 
arrow. This distinction will help to make later diagrams clearer. 


We kept the robot and the parts in Figure 22.6 in the same place in 
both parts of the figure so it was clear that those elements were the 
same in both versions. It will be helpful to juggle the pieces of this fig- 
ure around so that arrows always point right and up in the picture, as 
in Figure 22.7. 
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Repaired 
Phone 
| Pats FD (ri) > 
Broken 
Phone 
(a) (b) 





Figure 22.7: A cosmetic change to Figure 22.6 so that all arrows point 
either up or to the right. 


We still have just one parts supply. We’ve merely moved the pieces 
around so that the arrows flow left-right and bottom-up. 


Now let’s compress the two pieces of Figure 22.7 into just one diagram, 
in Figure 22.8. The important things to keep in mind are that this pic- 
ture represents two sequential steps, and the two boxes labeled “Parts” 
both refer to the same, single box. The outlined arrow is designed to 

remind us of this. 


Repaired 
Phone 





Broken 
Phone 


Figure 22.8: If we mush together the two steps in Figure 22.7 we can 
make a tighter diagram. Remember though that this represents two steps 
in time, and only one box of parts. 
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Let’s put our robot to work on a series of three different, broken phones. 
We can just line up multiple copies of Figure 22.8 so that each repair 
uses the parts box in the condition left by the previous repair. Figure 
22.9 shows the sequence. 


Repaired Repaired Repaired 
Phone 0 Phone 1 Phone 2 





Broken Broken Broken 
Phone 0 Phone 1 Phone 2 


Figure 22.9: Repairing multiple phones using the diagram of Figure 22.8. 
There’s only one robot and only one box of parts, but here we can see that 
in each sequential repair, the robot uses the parts box in the condition it 
was in at the end of the previous repair. 


It’s crucial to see that Figure 22.9 shows just one robot and just one 
set of parts. We’re showing multiple moments of time, like a multi- 
ple-exposure snapshot. Reading from the left, we start with the initial 
box of parts and Broken Phone o. The robot fixes the phone, return- 
ing to us Repaired Phone 0, with a changed box of parts. Then some 
time later, Broken Phone 1 arrives. So the same box of parts that was 
just updated is now used by the same robot, to create Repaired Phone 
1 and an updated box of parts. The cycle continues for all new broken 
phones. 


We've basically described how an RNN unit works. Let’s express the 
process we’ve just seen using the language of RNNs. Note that when 
there’s no chance of confusion between individual RNN units and 
RNN-based networks, we sometimes refer to an RNN unit as an RNN. 
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Our robot will be replaced by a single RNN unit, also called an RNN 
cell. As we mentioned above, this unit is really a little bundle of some 
artificial neurons and memory. It accepts an input and provides an 
output, but most importantly, it has the ability to read and write to its 
persistent internal memory (like the parts box). 


This persistent memory is the unit’s state. In an RNN, the state is usu- 
ally just a list of floating-point numbers. We specify the size of this list 
when we create the unit. We can use lengths of 3 or 5 in small projects, 
or lengths of several hundred in larger systems. As with other neural 
network parameters (like the number of neurons in a dense layer), this 
is a number we pick based on intuition, experience, and usually some 
experimentation. 


We can re-draw our repeating repair diagram of Figure 22.9 with this 
new language in Figure 22.10. This is the basic structure of a simple 
RNN. 


outputO output output2 





inputO input input2 


Figure 22.10: A simple RNN. There is only one RNN Unit, and only one 
State, but they are re-used by each sequential input, producing sequen- 
tial outputs. 


As we’ve mentioned, the state information is stored inside the RNN 
unit itself. That lets us make our diagrams a little simpler, because if 
we assume that the state is inside the RNN, we don’t have to show the 
state explicitly. But it can make our diagrams a little more mysterious, 
because we have to remember that the state is present, and “inside” 
the symbol for the RNN unit, and is being updated after each input as 
in Figure 22.10. 
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Another important aspect of Figure 22.10 is that the state is updated 
when processing is done. That is, when an input arrives, it is processed, 
and the output is presented, and only then is the new state made avail- 
able for processing the next input. 


22.4 Structure of an RNN Cell 


Let’s make a single RNN unit, or cell. It will take input and produce 
output, and it will have some internal state that it maintains along the 
way. 


Figure 22.11 shows a first stab at this. Let’s assume that our input, out- 
put, and state are all single, real numbers. As we did before, we’ve 

broken down the operation of the unit into two steps. Figure 22.11(a) 

shows step 1, where the input and the current state are combined to 

produce a new value. Since we haven't talked about just how that com- 
bination happens, we’re representing the operation as an empty circle 

for now. Figure 22.11(b) shows step 2, where this new value is written 

back to the state, and presented to the world outside this unit as its 

output. 


output 


HO OnGe) 


input 


(a) (b) 


Figure 22.11: A first shot at building an RNN unit. (a) In the first step, an 
input and the current state are combined (for now, we're not saying just 
how). (b) In the second step, the result of part (a) is written to the state 
memory, and sent to the output. 
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We can combine the two steps of Figure 22.11 into one diagram, as we 
did before. To reinforce the idea that we’re dealing with two different 
moments in time, we draw a small block box after the memory that 
holds the state, as in Figure 22.12. This box represents a delay step. 


output 


° 


input 


Figure 22.12: Combining the two steps of Figure 22.11 into a single diagram. 
The black box represents a single step of time delay. 


One way to interpret the delay step is that it’s a little buffer, or piece of 
memory, that holds a value temporarily. As we saw in Figure 22.11(a), 
when a new input arrives at the cell we want to combine it with the 
value of the state that was written at the end of processing of the previ- 
ous input. 


That’s just what the buffer gives us. In other words, the buffer “hangs 
on” to the value of the state at the end of step 2 from the previous 
cycle. When an input arrives, it always gets combined with the value 
of the state that was computed for the previous input. Only when pro- 
cessing is completely finished for this input, and the output has been 
presented to the world outside the unit, does the delay step update 
itself to the new version of the state. The delay lets us combine the two 
halves of Figure 22.11 without ambiguity, even though they happen at 
different times. 
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We'd like to finally fill in the missing operation in Figure 22.12 where 
the input and state are combined. But we’d also like to generalize this 
diagram with some values that we can control, so that we can tune 
the calculations to produce output that will be useful to us. These two 
steps influence each other. Let’s address the generalization step first. 


To give us some control over our little computation, in Figure 22.13 
we've introduced three weights. In this diagram, these are just real 
numbers, like any other weight in a neural network. Now we can adjust 
those weights, and thereby adjust the calculation and the output. 


output 





input 


Figure 22.13: At three places in our diagram we scale the value passing 
through by a weight. There will usually be an activation function on the 
output. We're leaving it off for now for simplicity. 


Now we can fill in the missing operation. We can use almost any oper- 
ation we want for combining the weighted input and the output of 
the delay step. But since we control the weights, we’re already able to 
manipulate the values we want to combine, so the combination step 
can be simple. So let’s choose to just add the values together, as shown 
in Figure 22.14. 
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output 





input 


Figure 22.14: We can combine the state information coming from the 
delay just by adding the values together. 


We’ve now created a bare-bones RNN unit! Since Figure 22.14 is a lot 
to draw, we can use the simpler icon in shown in Figure 22.15. Note 
that this icon is just a single cell, and not a layer. We'll see the layer 


symbol below when we build RNN layers and place them into our neu- 
ral networks. 


Figure 22.15: Our icon for a single RNN cell. 
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The symbol in Figure 22.15 stands for an entire sequence of applica- 
tions of the cell using a sequence of inputs, such as the sequence of 
phone repairs that we saw in Figure 22.10. This takes some getting 
used to. What looks like a single box with an input and an output is 
standing for a single box that is given a sequence of inputs, processes 
each element of that sequence one after the other, and produces a 
sequence of outputs. 


Figure 22.16 shows this more expanded view. It’s just a condensed ver- 
sion of Figure 22.10. In Figure 22.16(a) we show that the state begins 
with some initial value, and the RNN unit receives its first input. It 
produces an output, the state is updated, and then the same RNN unit 
receives a second input. Recall that the outlined arrow is meant to 
remind us that we’re looking at a change over time to a single RNN 
unit, and not a flow of information from one object to another. This is 
often called an unrolled diagram, since we’re showing multiple steps 
explicitly. 


initial 
state 





time time time 
step 0 step 1 step 2 


Figure 22.16: An unrolled RNN diagram. Because the state information 
is contained inside the RNN unit itself, we don’t need to explicitly show 
it. The RNN unit accepts an input, uses it to create an output, and then 
updates its internal state. Later, another input comes along, and the 
process repeats. Each vertical slice of the diagram represents a sequen- 
tial moment in time. 
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The blue boxes in Figure 16 are meant to suggest that each operation of 
input, processing, output, and updating of the state is a unique event. 
The diagram shows multiple such events, with the open arrow show- 
ing us that the state of the RNN unit is changing after each input is 
processed. 


When each of these inputs is made up of sequential values from a sin- 
gle feature of a sample, we call them time steps. 


The unrolled diagrams of Figure 22.16 represent the same operation 
as the rolled-up (or just rolled) diagram of Figure 22.15. In the 
rolled-up version we’re implying that there will be a sequence of inputs, 
and in response to each one the RNN will produce an output, update 
its state, and then wait for another input. In the unrolled diagram of 
Figure 22.16, we show this sequence of events explicitly. 


22.4.1 A Cell with More State 


Our diagrams of RNN cells have so far processed only a single number 
at a time, and held only one number for the state. In this section we'll 
generalize that. 


Why would we want to have larger inputs? 


Suppose we're going to create a chatbot. It will read as input a sequence 
of words that someone typed in, and it will produce an output consist- 
ing of a sequence of words. 


Let’s suppose that our chatbot has a vocabulary of 8,000 words. We 
saw in Chapter 12 that one way to encode categorical data is to use 
one-hot encoding, where we use a long list of 0’s with a single 1 in 
the slot corresponding to the number we want to encode. It turns out 
this is a great representation for input to an RNN. For our vocabulary 
of 8000 words, the one-hot representation of each word would be a 
string of 8000 zeroes, with a single 1 in there somewhere. Other types 
of data can have different numbers in every entry in the list. 
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We also want to store more than a single number in the state. We 
might want a state that can hold 3, or 10, or even a few hundred num- 
bers. If someone is describing a person to our chatbot, this would let 
us remember all sorts of things about that person. For example, we 
might remember the person’s name, their gender, the color of their 
eyes, what they’re doing at the moment, and so on. 


Our generalized RNN cell will use sets of neurons in place of the mul- 
tiple steps in Figure 22.14. Most of these neurons will have linear 
activation functions (the same as having no activation function at all). 


The traditional way to draw a set of inputs going into a collection of 
neurons is shown in the left of Figure 22.17 (remember that there’s an 
implied bias term that we’re not showing). That’s a complicated mess 
of lines. To make our RNN diagram easier to read, we'll draw the same 
thing using the diagram on the right of Figure 22.17. 


t 1 
(61) (82) (A) (44) (AS 





input 


Figure 22.17: Left: The traditional way to show a set of 3 inputs going into 
5 neurons. Right: A simplified version. 


The number of inputs, the number of outputs, and the size of the inter- 
nal state can all be different. So let’s say that our input will have 3 
values (perhaps it’s a tiny one-hot encoded version of a number from 0 
to 2), and the output will be a list of 4 numbers. The internal state will 
have 5 values, because we want to remember 5 different qualities of 
the inputs over time. Figure 22.18 shows how we could assemble our 
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RNN unit. We start by sending our input to 5 neurons named A1 to A5. 
Each of these multiplies the 3 values in the input with its own set of 3 
weights, to produce an output value. The result is a list of 5 numbers. 


LT 11 1 |state 


delay 





Figure 22.18: A generalized RNN unit that accepts input of 3 values and 
stores 5 values of state. An input of 3 values is processed simultane- 
ously by five neurons lettered A to create a list of 5 values. This is added, 
element by element, to the state. This result then goes through five 
neurons lettered B to create a new state, which then goes into the delay 
step. The result of the addition also goes into the 4 neurons lettered C to 


produce an output. 
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The outputs of set A are collected into a list of 5 values. This list has the 
same shape as our state, so we can add the previous value of the state 
to it element by element. The result of this goes to two sets of neurons. 
One set, marked B, contains five neurons. Those outputs are collected 
into a list and go into the delay step, so they will be combined with the 
outputs of the A neurons on the next step. The output of the current 
values of set A and the previous values of set B also goes to the 4 neu- 
rons marked C, which produce the 4 values of our output. 


These three sets of neurons generalize the single weights in our simple 
diagram of Figure 22.14. 


Together, they give us a lot of control over what this unit does. Each 
of the A neurons has 3 weights and there are 5 of them, for a total of 
3x5=15 weights. The B neurons have 5 weights each and there are 5 
of those as well, for 5x5=25 weights. There are 5 bias weights that are 
applied after the addition of the outputs of set A and the delayed out- 
put of set B, giving us 5 more weights. Finally, the C neurons also have 
6 weights (5 for the state, and one for the bias), and there are 4 of them, 
for 6x4=24 weights. That’s a total of 15+25+5+24=69 weights. That 
gives us a lot of values to adjust during training to get useful results 
from this unit. 


22.4.2 Interpreting the State Values 


In all of these examples, the RNN manages the state for us automati- 
cally, reading and writing values that help it ultimately produce results 
that will be useful to us, such as a chatbot’s response to someone’s 
question. 


It’s natural to ask what these numbers in the state represent. What, 
exactly, is being remembered and forgotten? We earlier suggested pos- 
sibilities like a person’s name and gender, but how can the RNN figure 

out that this is what it should save, and how does it work out how to 

represent that information? 
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This is like asking what the weights represent in the other types of 
neural networks that we’ve seen in the last few chapters. The weights 
represent something, but just what that “something” is depends on 
what the network learned when it was trained. During training, the 
network itself worked what needed to be saved, and how. 


In the same way, the RNN cell “decides” what the state’s contents 
should mean, and how to manage those numbers, so that the network 
as a whole ultimately produces the answers we're asking it for. Most of 
the time we don’t dig too hard to interpret these values, just as we don’t 
stress out over what the individual weights mean in dense or convolu- 
tion layers. But when we understand our problem well, we can try to 
reverse engineer the meanings of the numbers in the state. That can 
help us work out what parts of the input the network is paying atten- 
tion to, which can help in understanding its outputs and debuggging it 
when things go wrong [Karpathy15a]. 


22.5 Organizing Inputs 


The input to an RNN unit is, as usual, made up of samples. Each sam- 
ple is composed of features. The new wrinkle is that each feature is 
made up of time steps. Time steps are values of the feature measured 
at different moments in time. 


The label “time steps” suggests that our measurements are time based. 
There are lots of other ways that sequences of values might come about. 
For example, they might represent multiple circumferences of a tree 
trunk as we work our way from the base to crown, or the number of 
books on each shelf of a large library as we work our way through the 
stacks. But we'll stick with the idea of multiple measurements in time, 
since it’s a common scenario and the viewpoint that the language best 
fits. 
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The introduction of time steps presents us with a structuring problem. 
Until now we’ve had samples that contained features, which we were 
able to conveniently represent as a 2D list. Now we also have time 
steps, giving us a 3D volume. And there are lots of choices for how 
to organize that volume. Our choice for that organization makes a big 
difference in how the network interprets the numbers that are inside. 


The problem comes down to how we want to think about our data. An 
RNN “knows” that we have samples that contain features, and each 
feature contains a sequence of time steps, but we have choices about 
how we want to think about this data. And different interpretations 
can give us different results. 


When we draw pictures of data organizations below, we'll use the 
convention of assigning the directions (away, down, right) to (sam- 
ples, time steps, features). This matches the Keras library we'll see in 
Chapter 23. Other libraries may arrange their data in different orders, 
so it always pays to check the documentation to be sure that we’re 
structuring our data the way that our library expects. We'll stick with 
the Keras convention in this chapter. 


Let’s suppose we're interested in the temperature at the top of a moun- 
tain. Over the course of a day we take 8 hourly measurements of the 
temperature. We want to train an RNN to predict future measure- 
ments from this tiny data set. 


The three ways to organize this data, using the convention we just dis- 
cussed, are shown in Figure 22.19. Only one of these makes sense for 
our data. We show all three so we can better understand how the orga- 
nization of our data communicates how it should be interpreted. 
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1 sample 
1 time step 


8 time steps 


8 features 

















1 feature 1 feature 


(a) (b) (c) 


Figure 22.19: Three ways to structure our 8 measured values of weather 
data. Two of these structures say the wrong thing, according to the 
conventions we’re using. Here we assign the directions (away, down, right) 
to (samples, time steps, features). (a) We have one sample, which contains 
one feature. That feature contains 8 sequential measurements, or time 
steps. (b) We have just one sample, composed of 8 features. Each feature 
is made up of a single time step. (c) Every measurement is a sample 
composed of 1 feature, for which we have one sequential value, or time 
step. We want the organization in part (a). 


When we bundle up our data, we want to use the organization of Figure 
22.19(a). This structure says that we have one day’s worth of data (the 
sample), that day’s data contains the temperature (the feature), and 
we have multiple measurements of that temperature (the time steps). 
This matches our conceptual organization of the data. 


What if we gave our RNN data organized as in Figure 22.19(b)? That 
would be saying to the RNN (and ourselves) that we have 1 sample 
that contains 8 features, or 8 different types of measurements, such 
as temperature, wind speed, humidity, and so on. Each new feature, 
when it arrives, will be interpreted as containing a new list of time 
steps, which in our case will be a single value. Not only do we not have 
8 features, but the RNN won’t learn much, since our sequences are 
just 1 value long. After all, each feature has only 1 piece of data, and 
they’re assumed to represent entirely different types of measurements. 
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Figure 22.19(c) is even worse. Now we have 8 samples, or 8 completely 
separate collections of measurements. Each measurement has 1 fea- 
ture. And when the RNN looks at the values for that feature, it finds 1 
time step. A sequence of only 1 element isn’t enough to learn a pattern 
from. 


Now that we can structure our data, let’s broaden our horizons a bit, 
and suppose that we’ve measured multiple parameters at every read- 
ing. Let’s say we have three values, one each for temperature, humidity, 
and wind speed. That gives us 24 measurements, 8 each for our 3 types 
of data. So we can package that up as a single sample with 3 features, 
each with 8 measurements, as in Figure 22.20. 

















1 sample a 
8 time steps an 
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Figure 22.20: Organizing our data for 3 features, such as temperature, 
humidity, and wind speed. Each feature is made up of 8 time steps. 


Now we go out the next day and repeat our measurements, collecting 8 
new measurements of 3 values each. And we do that again, and again, 
every day for a week. Each day’s measurements represent 8 consec- 
utive samples, but the measurements from one day to the next don’t 
form a single continuous sequence, because they’re broken up by the 
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16 hours we didn’t measure each day. So they’re independent (though 
related) sequences. We can pack each one up in its own sample, as in 
Figure 22.21. 
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Figure 22.21: If we collect our 8 measurements of 3 types of data on 7 
consecutive days, we can arrange our data as 7 samples, each composed 
of 3 features, with 8 time steps per feature. 


The 3D volume of Figure 22.21 has dimensions 7 by 8 by 3. We can 
arrange those three numbers in a total of six ways, making six differ- 
ent block shapes. If we use one of the five shapes not shown in Figure 
22.21, the RNN will run, but it will not use our intended interpretation 
of the data, and will usually deliver disappointing results. 


Getting the organization right takes some thinking about our data and 
how we want the system to interpret it, combined with the expecta- 
tions of the library we’re using. 
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22.6 Training an RNN 


In Chapter 18 we covered the important algorithm of backpropaga- 
tion, which efficiently computes the error gradients for the weights in 
our network. We typically then apply an update step to modify each 
weight according to its gradient. 


It seems reasonable to presume that this will work for RNNs as well. 
After all, we said above that an RNN is just a packaged-up cluster of 
small artificial neurons and other pieces. Since they learn by backprop, 
it stands to reason that the whole unit should learn by backprop, too. 


In fact, we can do just that, though it’s not problem-free. Let’s start by 
thinking about applying backprop to an RNN. The usual approach is to 
“unroll” the RNN unit first, so we can more easily see what we're deal- 
ing with. When we create our RNN we tell it how many time steps we'll 
be providing (typically, every feature must have the same number of 
time steps, so that the tensor is a complete block of numbers). If we set 
up our RNN unit to have, say, 5 time steps, we can unroll the network 
diagram of Figure 22.22(a) in the form of Figure 22.22(b), explicitly 
showing the processing of each time step, and the open arrow showing 

that the state is updated after each step. 





output outputO output! output2 outputs output4 
initial 
—> —-> —- 
input inputO input input2 input3 input4 


(a) (b) 


Figure 22.22: An RNN unit can be drawn in “rolled-up” or “unrolled” form. 
(a) Rolled-up form. It’s implied that the unit will step through the time 
steps one by one. (b) Unrolled form, showing the explicit processing of 
each time step. 
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The version in Figure 22.22 is something we can apply backprop to, 
just like any neural network, just by pushing the error gradients back- 
wards. The only difference is that we need to keep in mind that every 
instance of the RNN represents the same unit with the same internal 
weights. 


This modified version of backprop is called backpropagation 
through time, or BPTT. 


But straightforward BPTT has a problem. Recall that each RNN unit 
is made up of neural networks, so the unit’s output is the output of 
an internal neural network that computes a value and passes that 
value through its activation function. As we’ve mentioned, we’ve been 
assuming linear activation functions so far, so we haven’t been draw- 
ing them, but generally speaking we'll have something more complex 
at the end of an RNN unit. 


The problem comes from those activation functions at the end of the 
neural networks inside each RNN unit, which typically use ReLU or 
tanh activation functions. As we saw in Chapter 17, the tanh has out- 
puts from —1 to 1. Figure 22.23 shows a tanh curve with a solid line. 


Repeated tanh 
1.0 = 
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Figure 22.23: The tanh function is shown with a solid line. If we take 
the output of the tanh at each point and apply that to itself, we get the 
vertically compressed, dashed curve. If we repeat this 5 times, we get the 
dot-dash curve, and 25 repeats gives us the dotted curve. 
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When we use backpropagation to push our results backwards through 
the unrolled RNN on the right of Figure 22.22, we go through one tanh 
function after another. In effect, we’re applying the tanh repeatedly. In 
Figure 22.23 we show what happens when we apply the tanh 2, 3, and 
4 times in a row. In short, the output values move closer to 0. But the 
real problem is that the region where the values change from negative 
to positive gets narrower. This means that a change to any input out- 
side that zone has no change on the corresponding output. 


This is terrible for learning, because it means that the gradients drop to 
Oo. Recall that when the gradients are o, there’s no learning. When the 
gradients are close to O, the system will improve with glacial slowness. 


This phenomenon is called the vanishing gradient problem 
[Hochreitero1] [Pascanui2]. The word “vanishing” means that the 
value “fades away” by moving closer to 0. Though we won’t get into it, 
the problem can go the other way, where the derivatives get bigger and 
bigger without end. This rarer phenomenon is called the exploding 
gradient problem [R2RT16]. 


And there’s another problem we need to address: there is only finite 
memory available to any real RNN for use as state. If we are trying 
to understand a sentence like, “Bob said he was hungry,” then we 
don’t need a lot of memory to connect “he” with “Bob.” But suppose 
that we have a sentence that begins, “When Bob saw his neighbor’s 
two cats outside the garage, watching him, he ignored their focused 
stares and continued with his elaborate workout regimen...” and after 
many words concludes with, “...and when it was all over, they were 
still there, watching him.” We might need a huge amount of memory 
to connect the early “two cats” with the later “they.” Even if we give 
our RNN unit a lot of memory, we can always construct an input that 
would need more. This is called the long-term dependency prob- 
lem [Hochreitero1] [Olah15]. 
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The good news is that we can solve both the gradient problem and the 
long-term dependency problem by using RNNs with fancy internals. 
The most popular of these RNN units goes by the acronym LSTM, and 
we'll look at it next. 


22.7 LSTM and GRU 


Our RNN unit in Figure 22.18 suffers from vanishing gradients, explod- 
ing gradients, and trouble with long-term dependencies [Bengio94]. To 
work around these problems, researchers have investigated a variety 
of approaches. One involves simple RNNs that are carefully initialized 
[Quoci5]. The more popular approach right now uses a kind of RNN 
cell with the seemingly contradictory name of Long Short-Term 
Memory, or LSTM [Hochreiter97]. The LSTM has become so popu- 
lar that when people today speak of an RNN unit generally, they often 
mean an LSTM. 


The name comes from thinking about what’s going on inside an RNN 
cell. We can say that a cell like that in Figure 22.18 has some per- 
sistent, or long-term memory in the form of its state. The state can 
hang onto its values from one input to the next, indefinitely. The cell 
also has some short-term memory in the form of the neuron outputs. 
These numbers are fleeting and exist only during the processing of a 
new input, and then are replaced with new values when a new input 
arrives. 


The goal of the LSTM is to take some of those short-term, fleeting val- 
ues and give them a longer lifespan, allowing them to contribute to 
future calculations. Thus the short-term memory is being given a lon- 
ger life, resulting in the name “long short-term memory,” or LSTM. 
It might be helpful to think of this instead as “persistent short-term 
memory.” 
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Covering the internal structure of an LSTM in detail would take us far 
afield, and it’s not necessary for understanding how to use it. But it’s 
helpful to have a basic knowledge of what’s going on, so let’s get an 
overview of its operation. 


As we know, each RNN cell contains some memory to hold its state, 
which is usually just a list of numbers. When talking about the inter- 
nals for an LSTM unit, or cell, we sometimes call the state the unit 
memory or cell memory. The cell memory is initialized with default 
initial values before the first input arrives, but as we’ve seen, those val- 
ues change as inputs are received and the cell memory is updated. The 
cell memory can also forget information that is no longer necessary. 
This is all under control of neurons located inside the LSTM. 


The beauty of the whole system is that with enough training, the weights 
in the neurons inside the LSTM unit learn to adjust themselves in such 
a way that they control the memory to remember and forget data in 
just the right ways at the right times. Remember that once the weights 
in these internal networks have been learned by backprop, they don’t 
change. But when the unit is evaluating new data, those networks con- 
trol the cell memory, which does change. 


22.7.1 Gates 


A key idea in the LSTM is a mechanism called a gate. 


We can visualize a gate roughly as a physical gate that controls the flow 
of water out of a pipe, as in Figure 22.24. 
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gate open gate open gate open gate open gate open 
100% 75% 50% 25% 0% 


Figure 22.24: Picturing a gate over a water pipe. 


When the gate is 100% open, all of the water that enters the pipe can 
exit it. When the gate is 0% open, it’s fully blocking the exit, and no 
water emerges. Intermediate positions of the gate allow through a cor- 
responding amount of the water. This metaphor is not perfect because 
the interaction of water with a gate like this is more complicated than 
were pretending. 


We can easily implement a gate in a program just by multiplying 
a starting value (the amount of water in the pipe) by the gate value 
(its position). The physical diagrams of Figure 22.24 can then be pro- 
grammed as in Figure 22.25. 





gate open gate open gate open gate open gate open 
100% 75% 50% 25% 0% 


Figure 22.25: Implementing the effect of a gate with numbers. In each 
diagram, the input value of 100 is shown at the top, the gate value is shown 
in blue, and the result is in the box at the bottom. We just multiply the 
input value by the gate value to get back the gated value. The percentage 
by which the gate is open is implemented as a number from 0 to 1 by just 
dividing that percentage by 100. 
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If we want to adjust a whole bunch of values by different amounts, we 
apply a different gate position to each one. In other words, we would 
multiply each input value by some other corresponding value. 


This is nothing more than the first step of an artificial neuron. Our 
neurons then go on to add up the results and apply an activation func- 
tion, but we don’t need all that extra stuff. We just multiply each input 
by its gate position and that’s our output. 


When we use gates in this way, we restrict the gate values to the range 
oO to 1. So when we apply a gate to an input value (or a list of gates to a 
list of input values), the input values can stay the same or drop by any 
amount until they hit 0. They can’t grow larger, because the gate will 
never be larger than 1, and they can’t change sign, because the gate 
will never be less than o. 


An LSTM uses gates for three purposes: forgetting, remember- 
ing, and selecting. The remember and select gates are also called the 
input and output gates. Let’s look at these in turn. 


We often think of “forgetting” as meaning that a memory is completely 
lost. In an LSTM, forgetting is usually a partial thing, where we can 
forget a value anywhere along the continuum from not at all to com- 
pletely. To forget a number means that we push it towards 0. When it 
reaches 0, it’s completely forgotten. We can selectively forget a value 
by multiplying it with a gate value, which we know is always between O 
and 1. Figure 22.26 shows a starting memory of 5 elements on the left. 
In the middle, we see five gate values. When the gate is 1, we'll forget 
nothing about the corresponding memory element, because it won’t 
change. When the gate is 0, we'll completely forget the contents of the 
corresponding memory element, because it will go to o. Intermediate 
values of the gate will cause us to “forget” the value of the correspond- 
ing memory element. The right column shows the memory after the 
forgetting operation has completed. 
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Figure 22.26: Forgetting values in memory. (a) Each value is multiplied 
by a gate, and the result is stored back into the memory. Gate values of 1 
don't forget anything about their corresponding memory cell, while gate 
values of O cause that value to be completely forgotten. Intermediate 
values of the gate forget the cell contents by a corresponding amount. (b) 
The operation in schematic form. 


The act of remembering involves two steps. First, we determine how 
much of each new value we want to remember. Of course, we use gates 
to control that. Then to remember the gated values, we merely add 
them in to the existing contents of the memory. 


In Figure 22.27 we see on the left a list of new values that we'd like to 
remember. But we don’t want to remember them all at full strength. 
So just as in Figure 22.26, we apply gates to them, producing a list of 
gated values. To actually remember these, we just add them into the 
existing memory, element by element. 
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Figure 22.27: The process of remembering. (a) We start with new values 
to remember, shown on the left. We may not want to remember these at 
full strength, so we first apply gates. Then we add the gated values to the 
existing memory, and save that back into the memory. (b) The operation 
in schematic form. 


Finally, to select from memory we just determine how much of each 
element we want to use. As shown in Figure 22.28, we apply gates to 
the memory elements, and the results are a list of scaled memories. 


selected 
memory gates values 
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(a) (b) 


Figure 22.28: Selection. (a) To select memories, we gate the values in our 
memory. The gated results are our selections. (b) The schematic version. 
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22.7.2 LSTM 


The LSTM unit uses the gating steps we saw above to manage its inter- 
nal memory. The gating operations give the cell a lot of control over 
what it remembers and forgets over time, so it can manage its internal 
cell memory in the most effective way. 


Let’s look at the architecture of an LSTM. Our discussion is adapted 
from the graphics and presentation by Olah [Olah15]. Figure 22.29 
shows a single LSTM cell. It accepts its previous output and a new 
value as input, and produces a new output. 


output 


previous 
output 





input 


Figure 22.29: The architecture of an LSTM unit. The triangles represent 
gates. The circles represent sets of neurons. 


At the top of the diagram is our state memory. We have three gates, 
labeled F for forget, R for remember, and S for select. There are also 
four collections of neurons, which we've labeled A through D. The 
input to these neurons is a single list formed by simply placing the 
previous output and the new input one atop the other. 


Let’s look at the three steps involved in processing an input. 
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We'll start with the forgetting step, highlighted in Figure 22.30. The 
new input and the previous output are used by the neurons in A to 
create a list of gate values. Those are then applied to the current state. 
This means that each element in the list of values held by the state 
either stays the same (if its gate value is 1), or else moves towards 0, 
causing us to “forget” some or all of that value. 


previous 
Output 


input 


Figure 22.30: The forgetting stage of an LSTM. The combined input and 
previous output are transformed by the neurons in A into gate signals, 
which then control the F gate and cause some of the values from the 
state to be forgotten. 


We can look at the individual pieces in the forgetting step by drawing 
the tensors and showing all the neurons. Figure 22.31 shows how this 
would look if our input and output each had 2 values, and we were 
using 3 elements in the state memory. The previous output and the new 
input are stacked on top of one another (as long as we’re consistent, it 
doesn’t matter what order we use). This 4-element tall tensor goes into 
three neurons, one for each element in the state. These neurons do the 
usual job of weighting each of the four inputs, and summing together 
the results. As the figure shows, the last step is to apply a sigma activa- 
tion function, which squashes each neuron’s output into the range o to 
1. This makes it appropriate to use as a gate. 
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Figure 22.31: Expanding the LSTM forgetting stage of Figure 22.30, 
showing the shapes of the data that is moving around. The previous 
output and current input are combined into a single tensor, which is fed 
into three neurons, each with a sigma activation function. Their three 
outputs are used to control a gate, whose input is the 3-element state. 
The output of the gate is the state values after being gated by the outputs 
of the A neurons. 


These gate values then control the forget gate. The input to the gate 
are the 3 elements in the current state. Each one is multiplied by its 
gate value, causing it to remain the same or move closer to O. 


So at the end of the forgetting stage, we have a temporary copy of the 
state where we’ve forgotten, or pushed towards 0, some of the ele- 
ments in the state. 


The next stage is to remember something about the new input that’s 
just come in (and something about the previous output as well, if we 
want). As shown in Figure 22.32, we send the combined previous out- 
put and input to two sets of neurons, B and C. The C neurons are going 
to be used as gate values, so they have a sigma activation function on 
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the end. The outputs of B are the values being gated (and, ultimately, 
remembered). These gates are used to control how much of the com- 
bined previous output and current input should be remembered when 
they pass through the R gate. 


previous 
output 


input 


Figure 22.32: The remember stage of an LSTM. The combined input and 
previous output are fed into two sets of neurons, labeled B and C. The 
outputs of C are used to control the remember gate R, which adjusts the 
values coming out of B. The result is that the input and previous output 
are scaled down, and then added to the version of the state coming out 
of the forget gate F. The result is then written back to the internal state. 


As we saw before, the resulting gated values are added into the state. 
This new value of the state is then written to the state memory, so these 
values are now remembered. 


Finally, we select some of this newly-computed state for output. As 
shown in Figure 22.33, we run the previous output and current input 
into another set of neurons marked D, which also have sigma activa- 
tion functions. We use the output of the D neurons as gate values in 
the gate marked S. The input to this gate is the new state we just com- 
puted, but first we ran that through a tanh function. This is the same 
S-shaped function that we saw when discussing activation functions 
in Chapter 17. Without getting into the details of why this step is there, 
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its function is to squash the values we just computed to the range —1 to 
1. Those values get gated by the S gate, and then presented as the out- 
put of the LSTM cell. 


[tanh }AS/ output 


o 


previous 
output 


input 


Figure 22.33: The select stage of an LSTM. The combined input and 
previous output are used to control gate S, which adjusts the new state 
that was generated by the remember stage. This is the cell’s output. 


To recap the LSTM’s operation, it takes the previous output and the 
current input and combines them. This combined signal will be used 
to form gate values, and will also get remembered. 


The first step is to forget some of the current contents of the state, 
by running it through the F gate. The second step is to remember 
some of the new information, by running it through the R gate and 
then adding that result to the state. The result of that becomes the new 
state. The third step is to select some of that new state to present as 
input, by running it through the S gate. 


In this way the LSTM can remember information indefinitely, by not 
forgetting it and then not adding to it. Or it can completely forget some 
information and replace it with new values. Or it can partly forget 
some information, and then partly remember some new values along 
with them. 
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All of this is controlled by the four sets of neurons that we’ve labeled A, 
B, C, and D. All of their weights are learned using gradient descent, like 
any other weights. Eventually the LSTM learns values for the weights 
that allow it to forget, remember, and select the right information at 
the right times in order to give us useful results. 


The LSTM avoids the problem of vanishing and exploding gradients 
because all of its calculations are bundled together internally. The acti- 
vation function that the network presents to the world is the linear (or 
identity) function, which isn’t changed if it’s applied to itself over and 
over [Sureshi6]. This avoids the problem that we saw from the sig- 
moid where it flattened out. 


The LSTM avoids the problems due to long-term dependency because 
the values in the state are “protected” from being forgotten, when 
appropriate, by the coordinated actions of the forget and remember 
gates. 


The original form of the LSTM [Hochreiterg7] has gone through sev- 
eral refinements over time [Gravesi4]. For example, the forget gate 
wasn’t part of the original design of the LSTM, but was proposed sev- 
eral years later [Gersoo]. 


One of the more famous variants of the LSTM is called the Gated 
Recurrent Unit, or GRU [Chung15]. The GRU is like an LSTM but 
with some simplifications. For instance, the forget and input gates are 
combined into a single gate [Olah15]. Since there’s a bit less work to 
be done, a GRU can be a bit faster than an LSTM. It also usually pro- 
duces results that are similar to the LSTM [Chung14]. When working 
with RNNs, it’s often worthwhile to try both the LSTM and GRU to see 
if either provides more accurate results for a particular network and 
data set. 
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22.8 RNN Structures 


RNN units are remarkably versatile, whether they're LSTMs, GRUs, or 
some other variant. We can use them to build many different kinds of 
structures to do different jobs. 


22.8.1 Single or Many Inputs and Outputs 


Figure 22.34 shows a variation of a famous diagram by Karpathy 
[Karpathy15b] that illustrates several of these structures in unrolled 
form, along with names that relate their numbers of inputs and outputs. 
Here the word “many” can be thought of a synonym for “sequence.” In 
these diagrams, we typically leave out inter-unit connections, like the 
one used by the LSTM to provide the output of one step as an input to 
the next. 
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Figure 22.34: Five different types of RNN structures in unrolled form. 
The names describe whether the input and output have one value, or are 
a sequence of many values. Image after [Karpathy15b]. 
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The one to one structure is included because we can build it, but it’s 
kind of a waste of an RNN. We give it a single input (that is, a feature 
with a single time step) and it produces a single output. We say it’s a 
waste because with a sequence length of 1, the RNN cell is making little 
use of its unique ability to remember things about its input sequence. 


The one to many structure takes in a single piece of data and pro- 
duces a sequence. We do this by giving the RNN the information we 

have to get it started, and then we let it run for a while, producing mul- 
tiple pieces of output. We might give it the starting note for a song, and 

the network produces the rest of the melody for us. 


The many to one structure reads in a sequence and gives us back a 
single value. This organization is used frequently in the field of sen- 
timent analysis, where the network is given a piece of text and then 
reports on some quality inherent in the writing. A common example is 
to look at a movie review and determine if it was positive or negative 
[Timmaraju15]. 


The many to many structures are in some ways the most interesting. 
Here we see two examples of this. On the left, we “prime” the RNN 
with several pieces of input before we ask it to start producing outputs. 
On the right, we start producing outputs right away. 


The first instance, where output is delayed, can be used for machine 
translation. In some languages words don’t come in the same order, 
so we can’t start translating right away. For example, the English sen- 
tence “The black dog slept in the hot sun” can be expressed in French 
as “Le chien noir dormait dans le soleil chaud.” In the French ver- 
sion, the adjective “noir” (black) follows the noun “chien” (dog), so we 
need to have some kind of buffer so we can produce the words in their 
proper English order. 
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In the second case, each new input produces a corresponding new out- 
put. We could use this to create a description for every frame of a video, 
or to disguise someone’s voice by transforming its sonic qualities into 

an older or younger version of itself, or even into the voice of someone 

else. 


Let’s suppose that we have a piece of video of a ball flying through the 
air. We'd like to assign a label to every frame, accounting for 4 differ- 
ent angles of ascent and descent, plus level flight, for a total of 9 labels. 


We can use a CNN to identify the ball’s position, but the CNN can’t tell 
whether the ball is rising or falling. That requires knowing what came 
before. In other words, we need some context. Cue the RNN! 


To assign these labels, we could build a classifier that starts with a con- 
volution layer and then goes into a layer that holds an LSTM unit. We 
call this latter layer a recurrent layer, or an RNN layer. The sym- 
bol we use for this layer is shown in Figure 22.35. The symbol for a 
layer shows an arrow on most of a counterclockwise circle, while the 
symbol for a single unit, shown in Figure 22.15 is only an arrow on half 
of a clockwise circle, so there are two cues to help us tell them apart. 


(a) (b) 


Figure 22.35: Symbols for an RNN layer. (a) A common symbol for RNN 
layers. The loop at the right is meant to remind us of the internal state 
being read and written. The black box represents a step of delay. (b) Our 
icon for an RNN layer. 
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Let’s start our network with a convolution layer, which will look for the 
ball in each frame of video. Figure 22.36 shows this as our first layer. 
For specificity in this discussion, we'll arbitrarily pick an image of 128 
by 128, and a convolution layer with 8 filters of size 5 by 5 and a ReLU 
activation. We'll shrink the output to 64 by 64 by using a stride of (2,2) 
(we could use a pooling layer here instead of striding). 


9 


1 
mem = 128 


8 Xx (5x5) 

ReLU 

stride = (2,2) 
128 x 128 


Figure 22.36: A CNN-LSTM for categorizing the flight of a bird from 
video. 


The output of the convolution layer is a tensor that’s 64 by 64 by 8. 
We'll flatten that and feed it into an LSTM layer. It will have 1 LSTM 
cell, with 128 elements of memory. The output goes into a dense layer 
of 9 neurons with a softmax output, so we get back 9 probabilities, one 
for each class. 


When we combine the convolution and recurrent layers in this way the 
result is often called a CNN-LSTM network. 
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This little network is going to require a lot of training data, since it 
uses just a little under 17 million weights. The convolution and dense 
layers together use only about 1400 weights, so the RNN is almost 
solely responsible for the complexity of this network. It’s devoting all 
of those weights to control the internal neurons that are computing 
the gating values, and the modified versions of the input and previous 
output. Reducing the number of LSTM cells in half to 64 also about 
halves the number of weights needed, and halving it again to 32 also 
halves the weight count, bringing it down to a bit over 4 million. 


We can stack up layers of RNNs to make deep recurrent networks, just 
like any other kind of layer. We can also exploit the sequential nature 
of RNNs and run them backwards. And we can combine the two. Let’s 
look at all of these architectures. 


22.8.2 Deep RNN 


We can arrange our LSTM units in layers, so that each the output of 
each unit serves as the input to other units. This is called a deep RNN, 
where the adjective “deep” refers to these multiple layers. 


A schematic view of a deep RNN is shown in rolled-up form in Figure 
22.37(a). We show an unrolled version in Figure 22.37(b). 
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Figure 22.37: A deep RNN uses multiple RNN stages, with the output of 
one feeding the input of the next. Each stage maintains its own state. This 
deep network uses three layers of RNNs. (a) The usual rolled-up diagram. 
(b) The unrolled version. 


The basic idea is that the LSTM on each layer feeds the LSTM on the 
next layer. So the first time step of a feature is fed to the first LSTM, 
which processes that data and produces an output (and a new state for 
itself). That output is fed to the next LSTM, which does the same thing, 
and the next, and so on. Then the second time step arrives at the first 
LSTM, and the process repeats. 


In this setup, all LSTMs before the last one are able to work with 
intermediate representations that make sense to the next layer, but 
wouldn’t be immediately useful to us as outputs of the network. So 
the early LSTMs can encode their data in dense and complex ways for 
maximum efficiency. In practice, we’d need the first layer to return a 
sequence, rather than a single value. We'll revisit this idea in Chapters 
23 and 24. 
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22.8.3 Bidirectional RNN 


The LSTM is designed to process a sequence of values, which we usu- 
ally think of as arranged in chronological order. 


As we've seen, if we’re trying to analyze text, this lets us consider the 
sentence, “Charles said he needs a vacation,” and figure out who is 
referred to by the pronoun “he.” 


But experience has shown that sometimes it makes sense to give inputs 
to an LSTM backwards, starting at the end and working backwards 
towards the beginning [Sutskever14]. Why would this ever be helpful? 


Suppose that we’re trying to make sense of the meaning in this sen- 
tence: “Saying, ‘I need a vacation’, Charles sat down.” If we want to 
know who “I” refers to, then scanning the sentence from finish to start 
would let us know it was Charles. 


This idea led to the introduction of the Bidirectional RNN or 
BRNN [Schusterg7]. It’s sometimes called a Bidirectional LSTM 
or BLSTM when we're specifically using LSTM units. 


As the name implies, this network runs the input in both directions at 
once. Of course this can only work in situations where we already have 
data all the way to the end of the chunk we're trying to analyze, so it’s 
not applicable to every situation. For example, if we’re trying to under- 
stand commands spoken in real time, to start at the end we’d have to 
wait for the person to finish speaking. 


Figure 22.38 shows the structure of a bidirectional RNN layer. We use 
two RNNs together in one layer, one getting the inputs from start to 
finish, the other getting the inputs from finish to start. This can seem a 
little mind-bending, but it’s straightforward from an architectural per- 
spective. Unfortunately, the standard diagram for a BRNN, as in Figure 
22.38, can make it hard to work out the unusual timing of inputs and 
processing, so we'll walk through an example. 
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Figure 22.38: A bidirectional RNN is two RNNs running together. One 
of them is given the time steps in the usual order, from first to last. The 
other is given the time steps in the opposite order, from last to first. In 
this diagram, there are only two LSTM units, each with its own state. (a) 
Our symbol for a BRNN. (b) An unrolled BRNN. 


Figure 22.38(a) shows our shorthand for a BRNN. In the unrolled ver- 
sion of Figure 22.38(b), we can see two LSTM units, one in light green 
and the other in darker green. To get started, the value of time step o 
is handed to the “forward” LSTM unit (light green) while stmultane- 
ously time step 4 is handed to the “backward” LSTM unit (dark green). 


When both LSTMs have produced their outputs, they arrive at the 
white boxes at the top of the figure. The two values at each white box 
are produced at different times, so the first one just sits until the sec- 
ond one arrives. We'll discuss in a moment how these two values are 
then processed. 


Now we proceed to the next step. We give the value of time step 1 to 
the forward LSTM, and time step 3 to the backward LSTM. Their out- 
puts are held at the white squares, and then we proceed to give time 
step 2 to the forward LSTM, and also to the backward LSTM. Both 
outputs go up to the white square, but before we deal with them let’s 
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finish processing the sequence. We give input 3 to the forward LSTM 
and input 1 to the backward LSTM, and then finally give input 4 to the 
forward LSTM, and input 0 to the backward LSTM. 


Now we have two values coming into each of the square boxes at the 
top. 


The square boxes combine their inputs in whatever way we choose. 
Typical options are to add them together, average them, multiply them, 
or create a 2-element tensor (that is, a list) by appending one value 
after the other. 


Those values can then go on to act as outputs of the network, or inputs 
to any other layer. 


22.8.4 Deep Bidirectional RNN 


If we need even more compute power, we can combine deep RNNs with 
bidirectional RNNs to create a deep bidirectional RNN, shown in 
rolled and unrolled form in Figure 22.39. 
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Figure 22.39: A deep bidirectional RNN. Each rounded box represents a 
single layer of the bidirectional RNN. We say this is “deep” because there 
are multiple such layers, each feeding the next. It’s bidirectional because 
in each layer there is both a forward and backward LSTM unit. Keep in 
mind that there are only six LSTM units in this diagram, two on each 
layer. (a) The rolled deep bidirectional RNN. (b) The unrolled version. 
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Deep bidirectional recurrent neural nets offer a lot of computational 
power. That power comes with a lot of weights to be trained, and a 
corresponding need for a lot of training data. For example, the 3-layer 
BRNN in Figure 22.39 configured for 25 cells in each of the six LSTMs 
uses about 37,500 weights. That’s equivalent to a stack of 3 fully-con- 
nected layers with 135 neurons each. 


Deep BRNNs have found use in applications like speech recognition 
[Zeyer17], image captioning [Vinyalsi5] [Wang16], and creating an 
animated talking head [Fan16]. 


22.9 An Example 


Let’s look at an RNN in action. 


For our example, we’re going to use an RNN to generate brand-new 
text. We'll train a little RNN using three collections of Sherlock Holmes 
short stories by Arthur Conan Doyle, all freely available online in text 
form [Gutenberg17]. Taken together, there’s a little over 304,000 
words. Many of these words are used repeatedly, of course. There are a 
bit under 29,000 unique words, including many proper nouns such as 
the names of characters and places. 


A reasonable approach is to think of the text as a collection of words. 
We can then train the RNN on how words follow one another. Then we 
can start with some words, and let the RNN tell us which word should 
come next. Then we'll take our starting bunch, plus the new word at the 
end, and have the RNN tell us which word should follow. Continuing 
the process, we can keep feeding back to the RNN the most recent set 
of words, and it would keep giving us a new word to follow. 


To do this, we can assign a unique number to each of the almost 
29,000 unique words in the text. For example, “the” might be assigned 
91, “Sherlock” might be assigned 307, “Holmes” might be assigned 
53, and so on. We can feed the network one bunch of words at a time, 
where the sample consists of a single feature with as many time steps 
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as there are words in that bunch. The RNN would read this sequence 
of numbers and try to learn something about which numbers follow 
which others. Rather than assign numbers to words haphazardly, we 
can try to assign similar numbers to similar words, using an algorithm 
called word2vec [Bussieck17] [Mikolovi3]. This form of numbering 
has many attractive qualities, though in this simple case it might not 
make much of a difference. 


With enough training, we can imagine giving our trained network a 
starting bunch of words extracted from the original text, though repre- 
sented as a string of numbers, and then let the RNN run and generate 
an endless stream of new words from that. 


This is an entirely reasonable approach and it can produce results that 
are recognizably like their source text [Deutschi6a] [Deutsch16b]. 


But this approach takes a lot of time to train, because it needs to figure 
out which of many thousands of words should follow from any previ- 
ous sequence of words. There’s just a huge number of choices, and so 
there are a lot of decisions to be learned. Getting good results is going 
to take a lot of time and computing. 


A faster alternative is to work merely character by character. That is, 
we treat the input as nothing but a string of characters. In the Holmes 
stories, with newline characters removed, there are only 89 unique 
characters remaining (26 lower-case letters, 26 upper-case letters, the 
10 digits, 22 punctuation marks, four accented vowels, and the space). 
Now we have a much smaller problem. Instead of predicting one of 
tens of thousands of possible words, we only need to predict one of 89 
characters. 


Let’s take this simpler and faster approach. 


We could make things even easier by converting all upper-case letters 
to lower case, removing the accented vowels, and reducing the wealth 
of punctuation to just a dozen or so of the most common symbols. 
Including the space, that would be just 49 characters. But there’s use- 
ful information in all of those symbols, so let’s leave them in. 
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The big idea is to build a RNN classifier. To train it, we'll provide a 
sequence of characters from the original text, and ask it to provide 
probabilities that each of the 89 characters is the next one. We'll 
compare the most likely prediction with the real next character from 
the text, and if they don’t match, we'll assign the result some error. 
Backprop will then train the weights to reduce the error, so that grad- 
ually the system’s prediction of the next character will match the label. 


This is the architecture we'll use to generate new text based on the 
Sherlock Holmes stories. Though it may seem almost impossible that 
anything comprehensible could come out of such a simplistic, charac- 
ter-by-character approach, it can produce surprisingly cogent output 
that matches a wide variety of input styles, from prose to technical 
documents [Karpathy15b]. 


Let’s look more closely at training. Each input consists of a string of 
characters, and a single new character that is our target. Suppose 
we give it the input, “My friend was an enthusiastic musician, being 
himself not only a very capable per”. The last word (from the original 
text) is “performer,” so our goal is to get the network to analyze this 
sequence and assign the highest likelihood to the letter “f.” 


Note that this isn’t a foregone conclusion. There are lots of words in 
the text that begin with “per” but continue with different letters, such 
as “personally,” “perched,” and “perhaps.” So the system has to take 
the whole sequence into account to correctly predict the next charac- 
ter. It’s the RNN’s ability to use the values that have come before, in 
sequence, to guide its decisions that makes it the perfect tool for this 
job. 


Our approach presents us with a tradeoff. The larger the input we give 
to the system (that is, the more characters to analyze on each input), 
the more information it gets and the better it can learn and predict. But 
the smaller the input, the faster the system can run and thus we can 
run through more training samples in a given amount of time. There’s 
no best answer here, so it’s another value we have to experiment with. 
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After some trial and error, we settled on the network of Figure 22.40. 
This is almost surely not the ideal choice of parameters, but merely 
one that was simple, and worked well enough for our purposes here. 





Figure 22.40: Our entire deep network for generating new Sherlock 
Holmes data. Our input is a list of 40 sequential characters. The charac- 
ters go into two RNN layers. Each contains a single LSTM cell with 128 
elements of memory. The output of the second LSTM is given to a dense 
layer of 89 neurons, which predicts the probability of each character. The 
most probable character is the network’s result. The small box at the top 
of the first layer’s icon tells us that it returns an output for every input, 
and not just the final result. We discuss this practical detail in Chapters 


23 and 24. 


Our input consists of a string of 40 characters. To create the training 
set, we chopped up the original source material into about a half-mil- 
lion overlapping strings of 40 characters, starting every third character. 
Figure 22.41 shows the idea. 
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Shelrlock Holmes she is always THE a I fhave sleldom heard 


o |Sherl olmes she is always THE wo 


k 

Sherlock Holmes she is always THE woman. 
k Holmes she is always THE woman. I 
k 


olmes she is always THE woman. I ha 
olmes she is always THE woman. I have s 


es she is always THE woman. I have sel 


Figure 22.41: Creating our training data. The source text (top line) is 
chopped up into 40-character pieces, starting at every third character. 
Each row below the first is a single sample of training data, presented to 
the RNN as 1 feature with 40 time steps. 


To train the network we eventually settled on an RMSprop optimizer 
with a learning rate of 0.01, and a mini-batch size of 100. 


On a 2014 iMac without GPU support, the network of Figure 22.40 
using the hyperparameters we just described took about 30 minutes 
per epoch. Using an Amazon Web Services “p2.xlarge” virtual machine 
with GPU support, that dropped to about 150 seconds (2.5 minutes) 
per epoch. 


Now that we know how to train, let’s see how to generate new text. 


To create new text, we produce a “seed” by picking a random starting 
point in the text, and then extract the next 40 sequential characters 
from there. We give the seed to the network and it produces a new 
character. That new character goes on to the end of the seed, and the 
first character is dropped, giving us a new 40-character seed to use as 
input to produce the next character. We can repeat this as long as we 
desire, creating new output [Chen17a]. Figure 22.42 shows the process. 
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Figure 22.42: Generating characters with our RNN. In this example, our 
RNN takes inputs of 4 time steps. Top: The seed is taken from the original 
text. Here it’s a string of the 4 characters “prac.” We provide these to the 
RNN, which predicts the next character will be a “t”. In the next row, we 
append the “t” to the end of the seed, and we drop the first letter, giving 
us a new 4-character string “ract”. We provide this to the RNN, which 


ey ey 


predicts an ‘i’. Again, we append the “i” to the seed and drop the first 
character, now giving us “acti”. We give that to the RNN, which predicts a 


(Se 


c, and the process repeats, generating as much text as we desire. 


To watch the progress of the network, after each epoch of training we 
printed the loss, and also generated some text using the network as of 
that moment. After the first epoch, using the random seed er price.” 
“Tf he waits a little longer , here is the start of the output, including the 
seed: 


nu 


er price.’ “If he waits a little longer wew fet ius ofuthe henss loll- 
inod fo snof thasle, anwt wh alm mo gparg lests and and metd 
tingen, at uf tor alkibto-Panurs the titningly ad saind soot on 
ourne’ Fy til, Min, bals’ thid the taes tuswe, yeouln is any Geotsant 
thive bast cxiss tilp the seud Bige tour and Crestte memofhl auch 
thoos ow thaa that yawt eranteat tisl wist yho halll hiced, h 
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In a sense, that’s remarkably good. The “words” are about English- 
sized, and though they’re not real words, they could be. That is, they’re 
not strings of random characters such as we might find in a password, 
like “mx,kG73}Kgl;?2”. To a surprising extent, these are close to being 
words. And this was after just one epoch. 


After 50 epochs, things improved a lot. Here the input seed is nt blood 
to the face, and no man could h. The output, with the seed, was: 


nt blood to the face, and no man could hardly question off his 
pockets of trainer, that name to say, yisligman, and to say | am 
two out of them, with a second. “I conturred these cause they not 
you means to know hurried at your little platter’ ““Why shoub- 
ing, you shout it of them,’ Treating, | found this step-was another 


write so put.’ “Excellent!” Holmes to be so lad, reached. 


Wow. Things are much better. Remember that the system has no 
knowledge of words at all. It only knows the probabilities of letters fol- 
lowing sequences of other letters. Yet we have mostly real words here, 
with obvious exceptions. Even some of the non-words (like “conturred” 
and “shoubing”) seem plausible. The punctuation is getting there, even 
including the comma at the end of the second quotation. For such a 
simple network and training regimen, this is remarkable. 


By letting this run, we can generate as much of this text as we like. It 
doesn’t get much more coherent, but it doesn’t get any more incoher- 
ent, either. 


Let’s zip ahead to epoch 100. Here’s the output generated from the 
seed, I was right and to add the very few deta: 


| was right and to add the very few details rum, and caused my 
vicyally to continued, at Chilstall, and my eye, Midlissapped in 
this girder on his important—and might be returned to turn 
smile him. “He had even out of the diven,” said Barker than 
bothidgar Missinisticular. IXteed much walk for fremed out of 
astictivening away through the lady, and photoh, when he rather 
throw all account. 
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The punctuation is much improved, but there are lots of non-words. 
This little excerpt doesn’t look like much of an improvement. Should 
we have expected it to be better? Figure 22.43 shows the loss of our 
network over these 100 epochs of training. 


Holmes training history 


2.5 


2.0 


Loss 


1.5 


1.0 
0 20 40 60 80 100 


Epoch 


Figure 22.43: The loss by epoch of our network of Figure 22.40, using the 
parameters in the text. 


The big win clearly came at the plunge at the start. It appears that 
things improved between epoch 50 and epoch 100, though not by a 
lot. But the declining slope at the right side of the graph suggests that 
if we were willing to keep training, the error would probably continue 
to drop at least for a while. Lower error should result in more readable 
output. 


Larger LSTMs (that is, those with more state in each unit) might give 
us better performance. Some nice results were obtained using a net- 
work like ours, but the two LSTM layers had 500 cells each [Tran16]. 
Our network had just a bit over 250,000 parameters, while this larger 
one had about 3.25 million parameters, and it was trained for 1000 
epochs. This kind of training requires powerful GPU support, a whole 
lot of patience, or both. 
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Keras 
Part 1 


The Keras library is a free, 

open-source Python library that 

makes it easy to build and train deep 

learning models. We'll look at the basic 

ideas, and then put them into practice with real data. 
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23.1 Why This Chapter Is Here 


In previous chapters we discussed the fundamentals behind machine 
learning and deep learning, and we’ve seen how to use several types of 
popular layers of neurons. 


In this chapter, we'll put all of this into practice. We'll build deep learn- 
ing systems and teach them. 


There are many fine deep-learning libraries out there, and each has its 
advantages. Rather than try to cover many libraries, we'll focus on just 
one, called Keras. This library is powerful, easy to use, popular, free, 
and open-source [Chollet17a]. 


Another advantage is that Keras lets us write our algorithms once, 
and then run them on any of several other popular and advanced 
deep-learning libraries. This means we’re insulated from the details of 
those libraries while still enjoying the advantages of using their high- 
ly-developed and efficient code. 


One of the nice things about working with Keras is that a typical ses- 
sion of building and training a machine-learning system requires very 
little routine Python programming. The actual deep learning code is 
often the easiest part of the program: we build the network with just a 
few lines, and train it with just one or two function calls. Most of the 
rest of the program is made of supporting tasks, such as getting the 
input data, cleaning it, structuring it for use in the network, writing 
routines for saving data and visualizing results, and so on. 


In this chapter we'll start with simple networks. In Chapter 24, we'll 
expand on these ideas to build more complex models, such as deep 
convolutional networks and recurrent networks. 


To keep things focused we'll stick to only the Keras routines and argu- 
ments we'll need to do our work. Once you’ve comfortable with the 
basics of Keras, you'll be able to explore other options on your own. 
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When we work with real code, we always have to deal with adminis- 
trative issues like processing our data, manipulating it in various ways, 
setting up helper routines, and so on. Controlling this detail, and keep- 
ing ourselves from constantly being sidetracked, is why we’ve limited 
our programming discussions to just three chapters in this book. In 
Chapter 15 we looked at the machine learning library scikit-learn. In 
this chapter and the next we dig into pleasurable but detail-aware work 
of programming deep networks. 


23.1.1 The Structure of This Chapter 


This chapter is not a straight line. 


Our goal in this chapter is give you the tools to design, build, train, and 
use a variety of deep learning networks. We will never lose sight of that 
purpose. But to get there, we will have to periodically stop and cover 
essential groundwork. That will often happen at the start of sections. It 
may sometimes feel like we take two steps forward and one step back. 
But that’s just because we need to pause to lock in a new idea. The pay- 
off will be all the sweeter because we will then see how that idea helps 
us build a working system. 


23.1.2 Notebooks 


Each of the sections in this chapter that has more than a few lines of 
code has an associated Python notebook. The name of the notebook is 
in a callout just after the start of each section. 


23.1.3 Python Warnings 


The world of Python libraries is constantly changing, though often 
in small ways. It’s not unusual for even low-level libraries to change 
from time to time in response to bug fixes and other improvements. 
Unfortunately, this can provoke a warning message when our 
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program calls a routine from that library using the old arguments or 
defaults. The purpose of the warning is to tell us that we should update 
our code to use the new version of the routine in question. 


This can be confusing or even frustrating when the warning is from 
a routine we're not calling directly. In other words, our code calls a 
library function, which in turn calls something in another library, and 
so on, many levels deep, until something calls an updated library using 
an old convention, and we get a warning. 


Happily, the situation is rarely dire. Updated libraries usually offer a 
generous amount of time during which they support both the old and 
new approaches, so despite the warning, everything will run just fine. 
It is just a warning, after all, and not an error. The volunteers who write 
and maintain most popular Python libraries are diligent about updat- 
ing their code to stay current. That means that at some later time, after 
we've done a routine update of our Python installation, all the libraries 
will be back in sync and the related warnings will stop appearing. 


The bottom line is that errors and crashes require our attention. 
Warnings from libraries that were called by other libraries can usually 
be ignored. 


23.2 Libraries and Debugging 


Before we dig in, it’s fair to wonder why we're using a library at all? 
Surely it would be more educational to write all of our own code from 
scratch, implementing all the algorithms in this book on our own. 
That process would force us to learn essential details that we could 
otherwise overlook. That argument has a lot of merit. For in-depth 
understanding, writing our own implementations (even if they’re just 
for toy networks) can’t be beat. 


But when it comes to actually building, training, and running deep net- 
works, libraries are almost always the way to go whenever possible. To 
rival what today’s libraries offer immediately, and for free, we’d have 
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to spend enormous time and effort on issues like numerical stability, 
optimization, GPU programming, multi-threading, and much more. 
Though many of these are not essential to getting a toy system to func- 
tion, when we start processing large amounts of data they become 
necessary to get good results in practical amounts of time. 


As an analogy, consider that most people today use a high level 
language. In this chapter we'll be using Python. But Python is not exe- 
cuted directly by the computer. Any high-level language is ultimately 
turned into assembly code, which is the language of the CPU. Maybe 
we should be programming in assembly. But why draw the line there? 
Assembly code is just a way of controlling the low-level hardware of the 
processor, manipulating individual circuit elements using a proces- 
sor-specific language called machine code. Maybe we should program 
in machine code. 


Of course, we don’t use machine code because it would take us for- 
ever to write anything substantial. The value of working at higher and 
higher levels of abstraction is that we’re freed up to think in more 
abstract terms, and we can spend our time working on how to struc- 
ture a solution to our problem than on the mechanics of controlling 
the computer. For same reason, using a library like Keras lets us think 
abstractly in terms of deep learning ideas, without getting bogged 
down in the mechanics of their implementations. 


Researchers are frequently publishing new and cleverer ways to teach 
deep networks with greater speed and efficiency. The best way to pre- 
pare for keeping up to date is to master the basics. A strong foundation 
lets us more easily understand and implement these complex tech- 
niques, since they often combine familiar ideas with a few new twists. 
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23.2.1 Versions and Programming Style 


Like the scikit-learn library we saw in Chapter 15, Keras is Python 
based. 


A note on versions. In 2008, the Python language made a jump from 
version 2.7 to version 3. which is now the standard. We'll be using 
Python 3.5 in this chapter, but any release version of Python 3 will be 
compatible with our code. Happily, most of this code will run fine on 
Python 2.7 installations. The most common difference is merely that 
in Python 3, when we use print, we place the argument in parenthe- 
ses, €.g., print ('Hello'), while in 2.7 we don't use parentheses for 
printing. 


Just as Python receives updates, so too does the Keras library. In 
2017, the Keras library went from version 1 to version 2. Most things 
remained the same, but there were changes. In this chapter we use 
Keras version 2.0.6. Keras 2 is compatible with both Python 3 and 
Python 2.7. 


Python is a powerful language that has a lot of clever tricks up its sleeve. 
There are ways to write code that is compact and efficient, and that 
code can build on the more than 60,000 libraries that can be installed 
for the language [Ramalho16]. But this is not a book about Python, or 
how to write the shortest or fastest code. 


For these demonstrations, we have preferred clarity and simplicity 
over compactness and even elegance. Our goal is to write code that 
can be understood, so we'll use variable and function names that are 
longer than one might use in practice, and we'll write out some expres- 
sions that could be combined into a single step. We'll even sometimes 
use parentheses that are not strictly necessary, if they make it easier to 
visually grasp what a line of code is accomplishing. 


Our input lines will be shaded light gray, and outputs will be light blue. 
We'll occasionally add a line break and spaces to an output line to make 
it fit the page. 
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As we build up our programs, we'll typically present small pieces of 
code one at a time. The idea is that the full program will be built by 
combining these pieces, usually just by entering them one after the 
next. By presenting the code in small pieces, it makes them easier to 
read and discuss. 


Many programs need Python import statements to bring in librar- 
ies, such as NumPy or Keras itself. Our convention will be to include 
the import statement the first time we present a listing that includes 
a function that needs it, but to avoid repeating big blocks of boring 
import statements we won’t repeat them in subsequent examples. 
Happily, there’s no penalty for importing modules we don’t need, or 
even importing the same module more than once. When develop- 
ing a piece of code, we could simply copy and paste a chunk of text 
that imports every library that we commonly use. When we’re done 
developing the code and we're cleaning it up, we can prune away any 
unnecessary or redundant import statements. 


23.2.2 Python Programming and Debugging 


Though this chapter presents a lot of code, we need to remember that 
this is like an art book showing final paintings, or an architecture book 
showing constructed buildings. Almost nothing starts out clean and 
nice. The code examples in this chapter were developed, one line at a 
time, debugged, improved, changed, debugged again, and so on. 


Although the final results may appear simple and straightforward, 
they usually took a twisty and often error-producing path to get to that 
point. The code you see in this book was messy and ugly when I was 
developing it, and then once it was working I cut away the stuff that 
wasn’t needed and cleaned up what was left. We should always expect 
to have to go through a similar process of incremental development 
with all programming, particularly when learning a new library such 
as Keras. 
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This process is much easier in Python than in many other languages 
because Python can be programmed interactively. That is, we don’t 
have to write our program in a text editor, save it, compile it, then run 
it. We can do this if we want. But we can also choose to type our code 
one line at a time into an interpreter, getting immediate results. This 
greatly encourages and rewards experimentation. 


The Jupyter system provides a very nice browser-based interactive 
system that is ideal for this kind of experimentation [Jupyter16]. One 
great thing about running Python in a browser is that we can have mul- 
tiple, independent tabs open at once. We can use one tab as our main 
development environment, another for experiments, another for test 
runs, and so on. And there are lots of useful shortcuts that save time 
[Devlin16]. 


A great way to use Jupyter is to grow code one line or statement at a 
time. We can try lots of little experiments, checking everything along 
the way until we’re convinced that we have all the details right. Then 
we can even wrap up that code with a function definition. 


Debugging can be a challenge when using Keras, because the errors 
are often inscrutable. Keras assumes for the most part that we know 
what we're doing, and it doesn’t do a ton of error checking on our code. 
When things do go wrong, we often learn about it because some low- 
level routine that we’ve never heard of finds that it can’t do its job. 
Understanding what went wrong in that routine is usually far from 
obvious. Having all the source code of Keras available can help, but 
debugging our code by reading through the library source requires a 
serious commitment of time and study. 


An easier approach is to find the call we’re making that triggers the 
problem, and then temporarily simplify it as much as possible until 
the problem goes away. If that fails, we can replace the call with a snip- 
pet of code from one of our other projects, or even an online example. 
Then we can transform the working code into our own code one step at 
a time, so we can discover just which step causes it to fail. 
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Some of this debugging can be done with little experiments in Jupyter. 
But other times we want to use a deeper and more fully-functioned 
modern debugger, equipped with features like breakpoints and sin- 
gle-step execution. We can find those tools in the development 
environment offered by PyCharm in the free PWCharm Community 
Edition IDE [JetBrains17]. Here one can do modern debugging like 
setting breakpoints, examining variables, and looking at a call stack. 


Copying code back and forth between the two environments can be a 
bit of a hassle, but it’s worth it to take advantage of Jupyter’s immedi- 
ate evaluation and feedback, and PyCharm’s robust debugging tools. 


In addition to Jupyter and PyCharm, there are many other Python 
development tools and environments to choose from. We used Jupyter 
and PyCharm for this book, but it’s well worth the time to explore the 
alternatives out there and find the tools that best suit your style. 


23.3 Overview 


Keras is a library for creating, training, and using deep-learning net- 
works [Chollet17b]. It’s written in Python, so it’s compatible with the 
scikit-learn library we saw in Chapter 15. In fact, it is deliberately 
intended to work alongside scikit-learn, and we'll be using both librar- 
ies freely in this chapter. 


Keras makes it easy to create a deep-learning network by simply build- 
ing up a stack of parameterized layers. This freedom of assembly is 
both a blessing and a curse. 


We can make an analogy to most written languages. In English, we 
can build up a sentence by placing together words left to right in a 
sequence. As long as we follow the rules of sentence construction, 
we can choose our words blindly, and they will always form a valid 
English sentence. For instance, “Shoes and grapes sang clumsy win- 
dows” is a valid English sentence, but it’s meaningless. Perhaps the 
most famous meaningless sentence is “Colorless green ideas sleep 
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furiously” [Chomsky57]. In fact, if we just cobble together structurally 
sound sentences, the vast majority will be meaningless. Meaningful 
sentences are rare. 


In the same way, we can assemble all kinds of deep-learning networks 
easily with Keras. But if we want a network that makes sense, meaning 
that it can learn from examples and make good predictions, we need 
to choose our layers, and their parameters, with care. Each layer has 
to make sense in the local context of the layers immediately preceding 
and following it, as well as the larger context of all the other layers in 
the network. 


Much of the discussion in this chapter is to provide enough understand- 
ing of what’s going on so that we can avoid the frustration of making 

the equivalent of “Pencils stumbling over burps never cook cooks.” The 

more we know about what Keras is doing, the better we'll be able to 

avoid building such oddities in the first place, and the better-equipped 

well be to fix them when we inevitably make them anyway. 


So in this chapter and the next, we’re going to carefully explain each 
step. The goal is that by the end of these chapters, you'll understand 
all the design decisions and choices, so you can design and implement 
new deep-learning networks with confidence. 


23.3.1 What’s a Model? 


The word model deserves some special attention, because it’s used by 
different authors and programmers to mean different things. 


The Keras documentation in particular uses model in three ways. 
First, it refers to the architecture of a deep learning system. Second, it 
describes the combination of that architecture and the weights that it 
learns as a result of training. Third, it can refer to the set of library calls 
that we use to construct our system, also called an API (Application 
Program Interface). For brevity, and to match the Keras documenta- 
tion, we'll use the word “model” in the same three ways. We will try to 
make the meaning clear from context. 
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23.3.2 Tensors and Arrays 


We'll be working with data structures that have different numbers of 
dimensions, and we often give them distinctive names. For instance, 
we usually call a 1-dimensional list just a list, a 2-dimensional arrange- 
ment a grid, and a 3-dimensional arrangement a block or volume. 
In machine learning, grids and blocks must be complete. That is, there 
can be no pieces sticking out, and no holes. Each side is flat and every 
cell is filled in. 


All of these arrangements belong to the category of tensors. In fact, a 
tensor can have any number of dimensions. 


To mathematicians and physicists, the word “tensor” refers to a much 
more general idea. The machine learning version of a tensor isn’t tech- 
nically incompatible with the mathematical definition, but they are 
different. This is rarely a problem, but it’s something to keep an eye 
open for when reading papers on machine learning that have a lot of 
physics in them, or vice-versa. 


NumPy also works with tensors, but the NumPy documentation usu- 
ally calls them arrays. Although to many programmers an “array” is a 
1D list, remember that in NumPy, the word refers to a tensor that may 
have many dimensions. 


23.3.3 Setting Up Keras 


To install the latest version of Keras, head to https://keras.io/. From 
the “Home” section on the left choose “Installation.” The instructions 
there tend to be short and directed to people who know how to install 
Python systems. If you’re not familiar with how to install libraries on 
your system, there are lots of websites that offer step-by-step instruc- 
tions. There are several popular package managers for Python that 
make it easier to install and manage libraries. 
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Before we start using Keras we have to make some important choices 
about the way we want it set up. In many libraries we can use the 
defaults and then learn how to adjust them later for specialized tasks. 
But there are some settings in Keras that we need to choose before we 
get going, because they will determine how we shape our data. Knowing 
about the underpinnings of Keras can be important even when writing 
our first program, so let’s dig in. 


The Keras library is built on many other Python libraries we’ve already 
encountered, like NumPy and SciPy. It also makes use of other librar- 
ies developed for building deep learning networks. In fact, Keras can 
be seen as a just a much easier way to use those deep-learning libraries. 


As of Keras 2.0.8, we can choose to execute our networks using either 
Theano [Theanoi6], TensorFlow [TensorFlowi16], or CNTK 
[CNTK17]. Keras calls these backends, since they are “behind” the 
unified Keras interface, and provide the engines that actually create 
and run our networks. 


These deep learning libraries have been developed by different groups 
using different principles. Keras hides their differences from us, pro- 
viding a unified and relatively simple way to build and run our networks. 


But the various libraries are different, and we have to choose one when 
we actually train our networks. Comparing and choosing between these 
three options is a moving target. TensorFlow and CNTK are under 
intense development, frequently receive new features and abilities, 
and enjoy ever-improving stability, accuracy, and efficiency. Theano’s 
development was halted in late 2017 with the release of version 1.0 
[Bengio17]. 


Because all the libraries implement the algorithms we saw in earlier 
chapters, we should get similar results when we run the same net- 
work on each backend. The specific results might vary, due to different 
implementations or ways of carrying out calculations. The differences 
that will be important in this chapter are speed and memory use. For 
most small projects the differences in these measurements from one 
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backend to another will be small, but when projects get large, we might 
find that one backend offers a different balance of speed and memory 
than another. 


Specifying which library, or backend, we want to use is easy, but it 
involves editing a small configuration text file that’s part of the Keras 
installation. The best place to find up-to-date information on every- 
thing about Keras, including where this configuration file is located, 
is the Keras website https://keras.io. Backend selection guides can be 
found under the “Home” tab on the far left. 


23.3.4 Shapes of Tensors Holding Images 


An issue that can’t quite be swept under the rug is how our data is 
organized. Particularly when we work with images, there are two pop- 
ular but different ways to structure the tensors that hold our data. 


Keras lets us use either approach, as long as we tell it which one we’ve 
chosen. We can do this by naming our choice in the configuration file. 
Let’s look at this choice, and how we identify it. 


Consider a single, RGB color image. The image has a width and height. 
There are also three channels, or slices, one each for red, green, and 
blue. As Figure 23.1 shows, we might imagine the images stacked from 
front to back, or left to right. 
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Figure 23.1: Two ways to stack images of size 100 wide by 200 high. We 
read the sizes of the blocks by the number of layers going away, then 
down, then right. (a) Stacking images from front to back. This block has 
dimensions 3 by 200 by 100. This is the channels_ first organization. 
(b) Stacking images from left to right. This block has dimensions 200 by 
100 by 3. This is the channels_ last organization. 


Suppose our image is 100 pixels wide and 200 pixels high, so in the 
order (rows, columns) we’d write this as (200, 100). We'll specify the 
dimensions of our 3D data structures in the order away, then down, 
then across. With this convention, Figure 23.1(a) places the number 
of channels first, creating a block with shape (3, 200, 100), and Figure 
23.1(b) places the number of channels last, creating a block with shape 
(200, 100, 3). 


Some libraries assume the data is in front-to-back form, and some 
assume it’s in left-to-right form. If we don’t match their assumptions, 
things can go very wrong. For example, if our library expects our data 
to be in the front-to-back order of Figure 23.1(a), but we’re storing it 
in left-to-right order of Figure 23.1(b), the library will think that we 
have 200 images, each 100 pixels high by 3 pixels wide. This will not 
give us the results we want! 
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Keras hides these library-dependent preferences from us, and restruc- 
tures the data as needed to make everything work. But we need to tell 
it which of these two approaches we're using. We do this by telling it 
whether our number of channels (in this case, 3) is the first dimension 
or the last when describing the block’s size. 


We usually provide this information in the Keras configuration file we 
mentioned above. In this text file, we identify how we’re organizing our 
data by setting the parameter named 'image_data_format' to either 
the string 'channels_first' or 'channels_last'. 


It’s always a good idea to make a backup of the configuration file before 
editing it. The file is plain text, so we can then open it with our favor- 
ite text editor and assign values to its variables, following the existing 
layout of the file. The values for most of the parameters will be strings 
that are named in quotes, and we need to preserve that. 


As an example, Listing 23.1 shows a typical Keras configuration 
file. Note the opening and closing curly braces. Here we're setting 
'image_data_format' to 'channels_last’, telling the system that our 
data is structured with the channels first. We’re also setting 'backend' 
to 'tensorflow’, telling Keras that we want to use TensorFlow as our 
library (or backend). The other two options are untouched. These are 
the options we'll be using in this chapter and the next. 


{ 
‘epsilon’: le-07, 
'backend': 'tensorflow', 
'floatx': 'float32', 
'jmage_data_format': 'channels_last' 
} 


Listing 23.1: A typical Keras configuration file. We've set the backend 
and image_data_format parameters. The other two are untouched. 
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When we import Keras into our Python code, Keras will read this con- 
figuration file. Then when we train our network with image data saved 
as we just specified, Keras will automatically restructure that tensor, 
if necessary, to match the expectations of whichever backend we’ve 
chosen. 


If we’re not working with images, then the setting of 'image_data_ 
format' is irrelevant. 


There are two other entries in the configuration file that we haven’t 
addressed. The parameter ‘epsilon’ is used to control numerical cal- 
culations. Its default has been carefully chosen to match the system’s 
internal algorithms, and it in normal use of the library it should not be 
changed. 


The variable 'floatx' tells the system what type of floating-point 
number it should expect the data to be stored in. This value is also 
rarely changed. 


We can also read and write the values of these variables (except for 
'backend') from our code. This way we can change them for a given 
program without modifying our configuration file. To access these val- 
ues, we use import to bring in the Keras module backend, and then call 
one of the functions in Listing 23.2. Changing these defaults should be 
done before calling any Keras routines. The convention is to call these 


very soon or even immediately after any import statements at the start 
of a file. 
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from keras import backend as keras_backend 


# read the values of epsilon, floatx, and tmage_data_format 
ep_value = keras_backend.epsilon() 

floatx_value = keras_backend. floatx() 

idf = keras_backend.image_data_format() 


# set the values of epsilon, floatx, and tmage_data_format 
keras_backend.set_epsilon(0.0000001) # rarely done 
keras_backend.set_floatx('float32' ) # rarely done 

# the important one 
keras_backend.set_image_data_format('channels_last' ) 


Listing 23.2: How to set Keras configuration values from code. Note 
that we cannot set the backend choice from code. Setting the values 
for 'epsilon' or 'floatx' is unusual, and should only be done by an 
expert. 


Note that the first line in Listing 23.2 is an import statement that 
brings in the necessary module from Keras. If we forget this line, we'll 
probably get a NameError from Python when it runs this code. 


23.3.5 GPUs and Other Accelerators 


Many computers today come with a Graphics Processing Unit, or 
GPU. As the name suggests, these devices were originally designed to 
speed up the processing of 3D graphics typically used by games, scien- 
tific visualization, and other 3D-intensive applications. To accomplish 
this, the chips were designed to implement the mathematical steps 
commonly used to create these images. GPUs quickly became increas- 
ingly powerful, plentiful, and cheap. 


In an unexpected surprise, machine-learning researchers realized that 
the feed-forward and backprop algorithms could be written in such a 
way that their mathematics looked a lot like the math that these chips 
were able to do so quickly, and in parallel. That is, not only could the 
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calculation be performed faster than if it was done inside a “normal” 
computer, the chip could also do dozens or more of these calculations 
simultaneously. 


The speed boost provided by using GPUs, particularly during training, 
had an enormous effect. Models that would have been impractical to 
train on a regular CPU were suddenly within reach. 


But not all GPUs are the same. Different manufacturers design GPUs 
with different features and technologies. NVIDIA has put a lot of 
explicit support for machine learning into their chips, and offer great 
deal of support software, much of which is known collectively as CUDA 
[NVIDIA17]. As a result, most machine-learning libraries have tar- 
geted GPUs made by that company. 


To provide an alternative, an open-source project called OpenCL is 
dedicated to producing a library that will enable authors to write GPU 
programs in such a way that they that will run on chips made by any 
manufacturer [Khronosi7]. As of early 2018, the project is still being 
developed, but some bits and pieces of different libraries can now 
make use of any GPU. This is a fluid situation that is changing fast. 
The most up to date information can be found online in blogs and dis- 
cussion boards. 


A newer alternative is the tensor processing unit, or TPU [Sato17]. 
This is a specialized chip designed for the kind of tensor processing 
needed by machine learning, and may be used instead of a GPU. As of 
early 2018, TPUs are rare on consumer-level hardware. 


23.4 Getting Started 


The Keras documentation, while complete, can also be challenging. 
Much of it is written for experts. For example, the documentation will 
identify the options that are available for a given routine, but it might 
not describe what those options mean, the pros and cons of each, nor 
what criteria we should use for choosing one. 
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We can often fill the gap with online tutorials and examples. In extreme 
cases, we can dive into the publicly-accessible source code and, in the- 
ory, work out exactly what every option does. To avoid that kind of 
internet microscopy and source-code spelunking, in this chapter we 
will motivate and explain all of our variable settings and choices. 


Many Keras functions take optional arguments, some of which are 
broadly useful, while others are for very specific circumstances. To 
keep the discussion focused, we'll only talk about the functions and 
arguments that we use in this chapter. 


Our first trek to a trained neural network will take us along three moun- 
tain tops before we get to the final peak, where we will reach our goal 
of a running network. We'll reach the first mountaintop when we’ve 
seen how to pre-process our data to make it ready for learning. We'll 
summit the second mountaintop when our network is built and ready 
to run. When we reach the third mountaintop we'll have seen how to 
run the network so it learns from our data. When we reach this final 
peak we'll have put it all together, taking us from an empty slate to a 
trained network that can make predictions on new data. 


Let’s climb! 


23.4.1 Hello, World 


The first program in the first book on programming in the C lan- 
guage demonstrated how to get the computer to print “hello, world” 
[Kernighan78]. Since then, printing “hello, world” has been used as 
the first program by innumerable books covering countless languages. 
The phrase “hello world program” has come to mean the first thing we 
learn in almost any programming language or computer system, even 
if it’s not literally to print that phrase. 
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Machine learning has two “hello, world” examples that just about 
everyone starts with: the iris dataset and the MNIST dataset. 
They're both categorization problems, based on small, free data sets. 
Because they’re so popular, Keras has special-purpose routines to let 
us read their data into our program with just a single line of code. 


The iris dataset is a collection of information about 150 different iris 
flowers belonging to 3 different species [Wikipedia17]. Each sample 
contains 4 measurements, or features: the length and width of 2 dif- 
ferent types of petals. Our job is to learn from this labeled data how to 
take in the 4 measurements of a new flower and predict which of the 3 
types it belongs to. Listing 23.3 shows the first few rows of this data. 


5,1; 345, 1.4,. 0.2, Iris-setosa 
4.9, 3.0, 1.4, 0.2, Iris-setosa 
4.7, 3.2, 1.3, 0.2, Iris-setosa 
4.6, 3.1, 1.5, 0.2, Iris-setosa 
5.0, 3.6, 1.4, 0.2, Iris-setosa 


Listing 23.3: The first few rows of the classic iris dataset. Each row holds 
the sepal length and width, petal length and width, and the name of the 
class that flower belongs to. We added some spaces for clarity. 


We've seen the MNIST dataset in previous chapters. This is a big col- 
lection of tiny grayscale scans (28 by 28 pixels) of hand-written digits 
from 0 to 9 [LeCun13]. The database is separated into 60,000 images 
for training, and 10,000 for testing. Each image is accompanied by an 
integer from O to 9 that serves as its label, telling us what digit the 
image contains. 


The drawings are diverse, with half coming from high school students, 
and half from employees at the US Census Bureau. The name MNIST 
stands for “modified NIST.” NIST itself refers to the US National 
Institute of Standards (NIST), where the data originated. The modi- 
fications involved pre-processing such as cropping and scaling the 
images. An interesting quality of these images is that some are ambigu- 
ous, even to human observers. Figure 23.2 shows 10 randomly selected 
examples of each digit, chosen from the training data. 

11 


Chapter 23: Keras Part 1 


OIOL 6fololojeloloja 
UA pe 
IMME EIESPAEIES 
3R EV BRBBEBIs 


SISIS|SIS|=]S1S 
Sol olelelejoleo 
PA WAG) WAAR AES 
EORAMRBEES 
MOBMMME GEE 


Figure 23.2: A random selection of images in the MNIST training set, 
organized by label. Notice the variation in thickness and style. A few 
details are worth noticing. The second 3 from the left almost disappears 
in places. The fourth 4 could be mistaken for a 9. The third 5 from the 
right could be called a 6 with an open loop. The rightmost 7 has a hori- 
zontal slash, which the other 7’s do not share. The upper loop of several 
of the 8’s is not closed. And the leftmost 9 has some extra artifacts. 
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Because the iris and MNIST datasets are the equivalent of “hello, world” 
for machine learning, they appear in almost every book and tutorial on 
the subject. This has both pros and cons. 


The pros are substantial. One important advantage of using either of 
these well-known databases is that because so many people have stud- 
ied them, they’re known to be good test databases. 
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Another advantage of both data sets is that because they’re very widely 
known, it’s easy to find a variety of networks that people have already 
built and trained. The UCI Machine Learning Repository, which hosts 
the Iris flower dataset, calls it “...perhaps the best known database to 
be found in the pattern recognition literature” [UCI16]. The MNIST 
data is not far behind. Tables of scores for MNIST (and many other 
standard databases) are online, along with the architectures of the net- 
works, so we can study and learn from them [Benenson16] [LeCun13]. 


Another advantage of these datasets is that they have proven them- 
selves to be excellent for developing skills in machine learning. They’re 
small enough that our programs will run quickly, and they describe 
concrete, understandable phenomena. The datasets themselves are 
clean, meaning that they’re free of typos, errors, and other details that 
can interfere with the learning process, for both humans and comput- 
ers. And the MNIST database, with a total of 70,000 samples, is big 
enough to do some real training and experimenting. 


The main downside of using these datasets is precisely that because 
they are so well-known, their use can become repetitive. 


On balance, we feel that the risk of over-familiarity is worth the bene- 
fits of using such well-understood and useful datasets. For consistency 
we'll choose the MNIST dataset for our examples in this chapter. 


Another substantial positive quality of the MNIST is that we can draw 
pictures of it. Abstract data is great, but it can be challenging to inter- 
pret. Images are great because we can evaluate many things about 
them just by looking. 
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23.5 Preparing the Data 


This section’s notebook is 
Keras-Notebook-01-Preparing-the-Data.ipynb 


According to the creators of MNIST, 


The original black and white (bilevel) images from NIST were 
size normalized to fit in a 20x20 pixel box while preserving their 
aspect ratio. The resulting images contain grey levels as a result 
of the anti-aliasing technique used by the normalization algo- 
rithm. The images were centered in a 28x28 image... [LeCun13]. 


So we know that the digits are all centered, the gray values range from 
black to white in each image, and looking at the data it’s clear that 
they’ve all been scanned in so the digits are mostly upright. All of this 
makes our lives easier. 


With most databases, we’d have to do this kind of pre-processing work 
ourselves to make our samples consistent and comparable with each 
other. We’d also have to weed out bad scans, correct mislabeled dig- 
its, and otherwise check and recheck (and recheck!) our database to 
make sure it was both complete and accurate. When all of this has 
been done, we say the database is clean. Cleaning a database can take 
a huge amount of time and effort, and another big advantage of using 
the MNIST data is that a lot of the cleaning has already been done. 


We're going to go through the remaining pre-processing of the MNIST 
data slowly and carefully, one step at a time. We'll use tools from both 
Keras and scikit-learn. This is both to carefully demonstrate what we’re 
doing, and to show the sort of thinking we go through when we think 
about pre-processing. 


Our goal is not just to pre-process the MNIST data, but to present the 
flow of the process, so we can apply it to new databases in the future. 
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It’s always important to get a good feeling for our data before we start 
to work with it. Visualization, statistics, and even direct examination of 
the data files can give us insights into the character of our data. These 
insights are always useful when we think about how to process and 
learn from our data. 


23.5.1 Reshaping 


In this chapter we’re going to reshape our data several times. Rather 
than roll along for a while and then stop to discuss this operation, we'll 
cover it now so it will be familiar when we need it. 


Reshaping can be a mysterious process for programmers who haven’t 
worked with multidimensional arrays (or tensors), so here’s a short 
overview of what’s going on. Readers familiar with multidimensional 
arrays and reshaping them should at least skim this section, because it 
contains the conventions we'll be using to draw and refer to our data. 
We also introduce a few useful features of NumPy along the way. 


Reshaping is a general programming idea, so the ideas covered here 
are applicable to any programming language or task, not just Python 
or machine learning. 


We'll start by imagining a list of 12 objects, which we'll name with 
labels A through L. Figure 23.3 shows these items. 


23 45 6 7 8 9 10 11 


Figure 23.3: We have 12 items arranged in a one-dimensional list. Each 
element in the list is made up of just a single letter. Each element requires 
only a single index from 0 to 11 to identify it. 
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We call this a one-dimensional list, or simply a list, because we 
need only one dimension, or index, to identify which element we 
want. In a 1D list our convention will be to start at the left and count to 
the right. We always count indices starting with o [Dijkstra82]. 


So the cell at index 1 contains the label “B,” and the label “H” is in the 
cell with index 7. 


Here’s the key point we’re going to see in this section: we can tell the 
computer to think of this data arranged in different ways, but we 
never change this list. No matter how we re-shape it, the underlying 
data stays in a one-dimensional list and isn’t affected. By re-shaping 
the data, all we’re doing is telling the computer how to interpret the 
data when we read or write it. The data itself is not touched (as always, 
there are exceptions to this generalization, particularly when efficiency 
measures are applied. But those are usually invisible to us as users of a 
library). 


NumPy offers a convenient routine that lets us reshape any input 
data into many different forms. For example, we can make a 2D grid 
that is 3 rows down by 4 columns across, as in Figure 23.4. 





Figure 23.4: Our one-dimensional list of Figure 23.3 re-shaped into a 
2D list of 3 rows and 4 columns. Each entry now requires two indices to 
identify it, in the order down and then right. We place these indices in 
parentheses, separated by a comma. Starting in the upper left, we work 
our way right, then go down one row and start again from the left. 


We call this is a two-dimensional list, or a grid, because we require 
two numbers to identify each element. 
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There’s a possibility of confusion here that’s worth addressing. In 
Figure 23.3 each element is drawn in a little box, and we called a hor- 
izontal row of those boxes a 1D list. But in Figure 23.4 we also drew 
our elements in little boxes, and called that arrangement a 2D grid. 
Couldn’t we interpret Figure 23.3 a 2D grid also, one that’s 12 elements 
wide by 1 element high? 


We definitely can, and we sometimes will. This is the source of the 
potential confusion we just mentioned: we can’t tell just by looking at 
Figure 23.3 if it’s a 1D array, or a 2D array of 1 row and 12 columns. 
We'll have the same problem later with 2D grids seen from the side, 
which might look like just the nearest slice of a 3D volume. 


As humans looking at pictures, it’s usually not a problem if we inter- 
pret of a row of boxes like Figure 23.3 as a 1D list, or a 2D grid with 
1 row. But when we're programming, the distinction is critical. Most 
library routines are strict about their parameters, and they'll complain 
or even crash if they get passed a variable with the wrong number of 
dimensions. If a routine expects a two-dimensional input, then it had 
better get a two-dimensional input, even if, to us humans, it’s just a list 
of numbers. 


When we get to the programming examples, we'll be careful to keep 
track of the number of dimensions in our data structures. In any dis- 
cussion where the difference is important, we'll always be clear about 
how many dimensions make up any particular tensor. 


Returning to our 2D grid of Figure 23.4, in a such a grid our conven- 
tion is to use the first index to count down, and the second to count to 
the right. In brief, we index a 2D array as (down, right). 


This ordering is completely for our convenience. The computer cares 
about how the data is arranged, but it doesn’t care how we picture the 
data’s arrangement when we make diagrams for ourselves. But since 
we'd like to be able to draw pictures of our data, like Figure 23.4, and 
we want them to mean the same thing to everyone, we use the conven- 
tion of listing the indices as down and then right. 
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Our (down, right) convension is popular, but not universal. We'll 
sometimes find pictures in documentation or other publications that 
interpret the data in some other order. It always pays to check. 


Another convention is that we fill up the cells by starting at (0,0), then 
increment the rightmost index to get (0,1), then (0,2), and so on, until 
we reach the end of the row. Then we set the rightmost index back to 
zero and increment the index to its left, putting us at (1,0). We then 
continue to the right, with cells (1,1), then (1,2), and so on. 


Using the down-then-over convention, we say that the layout of Figure 
23.4 is arranged 3 by 4, meaning there are 3 rows and 4 columns. The 
cell at index (1,2) contains the label “G,” and the label “J” is in the cell 
with index (2,1). 


There are many other ways to arrange the 12 elements of our list into a 
2D box. Continuing to use our convention of filling up the boxes left to 
right, then top down, Figure 23.5 shows a few other possibilities. 





Figure 23.5: Three more ways to arrange our 12 items into a 2D list. From 
left to right, these grids have dimensions 4 by 3, 2 by 6, and 6 by 2. 


We can even reshape our data into 3D. As in 2D, there is no universal 


convention for drawing data in 3D. Recall that in 1D, our one index 
told us how far to the right to move. When we needed a convention for 
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9 


2D, we put “down” in front of the 1D “right”. For 3D, we'll put “away’ 
in front of the 2D “(down, right)”, giving us the order (away, down, 
right). We start in the near, upper-left corner. 


This has a nice analogy to reading a book. To identify a particular letter, 
we'd specify the page (away), the line of text (down), and the letter’s 
position in the line (right). 


Figure 23.6 shows this visually. 











Figure 23.6: Our convention for identifying cells in a 3D block will be 
to start in the near, upper-left corner. We name cells in the order (away, 
down, right). (a) The three directions in sequence. (b) Finding a cell in a 3D 
volume. The first index tells us how far to move away, the second index 
how far to move down, and the third index how far to move right. 


This fits nicely with our 2D convention as above. We think of our block 
as a collection of vertical slices arranged front to back. Each vertical 
slice is indexed in the order down and then right, just as in our 2D 
arrays above. In terms of our two arrangements in Figure 23.1 above, 
this is the channels_last organization. 


A 3D block with indices is shown in Figure 23.7. 
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Figure 23.7: Identifying each cell in a 3 by 3 by 3 cube. Each cell requires 
3 numbers to identify it. (a) Using our convention of Figure 23.6 we count 
away, down, and right. The rightmost index changes the fastest, then the 
middle index, and finally the left-most index. (b) Filling in the letters A-Z 
in order. Since there are 27 cells and only 26 letters, we placed a star at 
the end of the alphabet, in cell (2,2,2). 


The closest vertical slice of nine cells are all indexed by their usual 
(down, right) values, with an “away” value of o. The vertical slice in 
the middle has the same indices, but an “away” value of 1. And the far- 
thest slice has an away value of 2. 


Figure 23.8 shows three different ways to organize 12 entries into 
blocks. 
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Figure 23.8: Three ways to organize our 12 elements into 3D blocks. From 
left to right, these have dimensions 2 by 3 by 2, 2 by 2 by 3, and 2 by 1 by 6. 


We can change our arrangement of 12 items to any shape in Figure 23.8, 
and do so repeatedly, but remember that this operation only changes 
how the computer refers to the information. We never change the data 
itself. In other words, the computer does not move data around when 
we tell it to reshape it to some other shape. Reshaping simply tells the 
computer how we're going to name the elements: how many dimen- 
sions we'll use, and what values each dimension can take. It just saves 
those numbers, and then uses them when we actually read or write the 
data. So re-shaping a list of 12 elements is no faster than re-shaping a 
list of 12 million elements. The computer just remembers how many 
dimensions we have, and how big each one is, so it can locate the one 
we want when we provide a set of indices. 


This principle is vital because it means we can repeatedly re-shape 
the data for different purposes, and it will always stay in order. So for 
example we can take our MNIST training samples, which arrive as a 
3D box, and flatten them out, and then re-shape them in a 4D struc- 
ture, and the data is never altered by these steps. In fact, we’ll do just 
these sorts of things in the code below. 


We just referred to a 4D data structure, meaning that we'll access our 
elements with 4 numbers. That’s not easy to draw. 


There’s a nice way to visualize these multidimensional lists that 
works for any number of dimensions. 
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We think of our data structure as a list of lists. Instead of arrang- 
ing our data spatially, as in the above figures, we draw the 1D list that 
represents the data in the computer’s memory, and place the various 
pieces into a hierarchy of simple 1D lists, where each list is nested 
inside another. 


In a 2D grid, there are 2 levels of nesting (each row is a list of elements, 
and the whole grid is a list of rows). In a 3D block, there are 3 levels 
(each row contains elements, each horizontal slice contains rows, and 
the whole block is a list of slices). 


For example, recall the 3 by 4 grid in Figure 23.4. We can think of this 
as a grid of 3 rows of 4 items each, or as a list of 3 lists of 4 elements 
each, as in Figure 23.9. 
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Figure 23.9: The 2D grid of Figure 23.4 has three rows of four elements 
each. We can show this as a hierarchy of 1D lists. Each set of four elements 
(that is, a row) is ina list. To identify any element, we first choose the list 
we want (the row), and then the element we want from that list (the 
column). 


To find the element at cell (1,2) we go to list 1 (that’s the second list, 
since we start counting at zero) and then select the third element. So 
element (1,2) is “G.” We don’t refer explicitly to the outermost list, 
since that’s just a wrapper to keep everything together. 


In the same way, we can nest our lists another level and represent the 
3D blocks of Figure 23.8 as a set of nested lists. Figure 23.10 shows 
how this would look for the leftmost block of dimensions 2 by 3 by 2. 
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Figure 23.10: The 12 elements of our list arranged in the 3D block of size 
2 by 3 by 2, as in Figure 23.8. To identify any cell, we need three numbers, 
corresponding to the indices in each of the nested lists. 
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We read the indices in the same way as before, starting with the outer- 
most list and working inwards. 


The element at index (1,0,1) is in the second outermost list, then the 
first list inside that, and then the second element of that list, giving us 
the label “H.” 


This offers another way to see that the data itself is never touched. The 

list-of-lists approach for the other blocks in Figure 23.8 are shown in 

Figure 23.11. We can see that the data is still just a simple, one-dimen- 
sional list of cells in order, and our reshaping simply tells the computer 

to group them together in different ways. 
















































































































































































Figure 23.11: Interpreting the middle and right blocks of Figure 23.8 as 
lists-of-lists. Top: The block is 2 by 2 by 3. Bottom: The block is 2 by 1 by 6. 


Note that in all of our examples, all of the lists at each layer have the 
same length. This is just another way of saying that our structures have 
no holes or extra bits sticking out in any dimension. 
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We reshape data using the NumPy function reshape(). Like many 
NumPy functions, we can call this in two different ways. Let’s suppose 
we have data in a variable called demoData, arranged in a 2D grid like 
we saw above, with 3 rows and 4 columns each. We’d like to rearrange 
this as a grid of 6 rows and 2 columns. We communicate the new shape 
we want by handing reshape() a list (or tuple) containing the new size 
along each dimension. For this example, we’d give it (6, 2). Wecan 
assign the result back to demoData if we like, but let’s save it in a new 
variable called newData. 


If demoData is not a NumPy array, we need to call reshape() from the 
NumPy library. We give it the array we want it to reshape, and the list 
of the new dimensions. This is shown in Listing 23.4. 


import numpy as np 
demoData = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]] 
newData = np.reshape(demoData, (6, 2)) 
print (newData) 
[L 1 2] 
[3 4] 
[5 6] 
L 7 8] 
[ 9 10] 
le ee be 


Listing 23.4: Reshaping the array demoData by calling reshape () 
directly from NumPy. 


If demoData is a NumPy array, then we can call reshape() as a method 
of the array itself. To turn a Python array into a Numpy array, we can 
call Numpy’s array() method. This will work for an array of any shape. 
That is, the input can be a tensor with any number of dimensions, and 
the output will be a Numpy array (or tensor) of the same shape. This 
version of reshaping is shown in Listing 23.5. 
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demoData = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]) 
newData = demoData.reshape((6, 2)) 
print (newData) 


PEt 2] 
[3 4] 
[5 6] 
[7 8] 
[ 9 10] 
pid i21] 


Listing 23.5: Reshaping the array demoData by calling reshape() asa 
method of the array. 


The only rule is that the total number of elements in the tensor can’t 
change. That is, if we multiply together all of the dimensions in the 
original shape of the tensor (here, 3 by 4), we must get the same value 
as when we multiply together all of the dimensions in the new shape 
(here, 2 by 6). Since 3x4=12 and 2x6=12, our examples worked. 


If we try to reshape our data to an incompatible size, Python will com- 
plain. For example, Listing 23.6 shows the output from the interpreter 
when we try to reshape our 12-element array demoData to the shape 
(5,15). Since we don’t have 5x15=75 elements, Python reports an 
error. 


demoData = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) 
demoData.reshape((5,15) ) 
ValueError Traceback (most recent call last) 
<ipython-input-5-a51a5832a9f8> in <module>() 

1 demoData = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) 
----> 2 demoData.reshape((5,15) ) 
ValueError: cannot reshape array of size 12 into shape (5,15) 


Listing 23.6: Reshaping the array demoData to an incompatible size 
causes an error. 


We'll make use of reshape() quite a bit. 


1125 


Chapter 23: Keras Part 1 


23.5.2 Loading the Data 


Now that we have re-shaping under our belts, let’s return to our main 
goal of getting a neural network up and running. We'll begin by getting 
our hands on the data, and then prepare it for training. 


Listing 23.7 shows how easy it is to load the MNIST set, since it’s pro- 
vided with Keras. To get it, we import the mnist module and then use 
its custom load_data() function to get the data. This returns two lists: 
the training data and the test data. Each list in turn contains two lists, 
holding the features (that is, the images), and the labels. We can use 
Python’s convenient assignment mechanism to assign all four lists to 
our own variables with just one statement. 


from keras.datasets import mnist 
(samples_train, labels_train), (samples_test, labels_test) = \ 
mnist.load_data() 


Listing 23.7: Load MNIST data. It will be downloaded automatically if 
needed. 


This is a good moment to point out that Keras functions (and their 
arguments) are pretty consistent about naming which kind of data set 
various objects belong to. Training data usually has the word train 
in there somewhere, test data has the word test, and validation data 
usually has the word val somewhere in its name. 


As we saw in Chapter 8, when we use a technique like cross-validation 
we break down our input data into the training set, the validation 
set, and the test set. We teach many variations of the system using 
the training set, and then after each training we evaluate the perfor- 
mance with the validation set. When we’re done searching, we select 
the model we want to deploy, we measure its performance with the 
test set. So the training and validation sets are used over and over, and 
the test set is used only once. 
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When we're not cross-validating, we need only the training set and the 
test set. The Keras documentation for mnist.load_data() identifies 
the returned data as belonging to these two categories, as in Listing 
23.8 [Cholleti7a]. 


# Load MNIST using conventional names for returned objects 
(x_train, y_train), (x_test, y_test) = mnist.load_data() 


Listing 23.8: The routine mnist.load_data() returns a training set 
and a test set. 


If the MNIST data has not been previously downloaded to this com- 
puter, then when we first load it Keras will automatically fetch a 
compressed form from the web, decompress it, and then save it in 
the directory that Keras maintains for these types of downloads (the 
exact location of this directory for each type of operating system can 
be found in the Keras documentation). If we request this data again on 
this computer, Keras will automatically grab the data already saved on 
the disk, saving us lots of time. 


In Listing 23.7, the first pair of variables, samples_train and 
labels_train, holds arrays with the 60,000 images that form the 
training set, and their corresponding integer labels. The second pair 
of variables, samples_test and labels_test, holds arrays with the 
10,000 images and labels that make up the test set. 


Let’s get a quick look at their shapes by printing them out in 
Listing 23.9. These arrays all come back to us from Keras already as 
NumPy arrays, so they all have a built-in shape attribute we can print. 
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print('samples_train shape = ',samples_train.shape) 
print('labels_train shape = ',labels_train.shape) 
print('samples_test shape = ',samples_test.shape) 


print('labels_test shape = ',labels_test.shape) 


samples_train shape = (60000, 28, 28) 
labels_train shape = (60000,) 
samples_test shape = (10000, 28, 28) 
labels_test shape = (10000,) 


Listing 23.9: The shapes of the MNIST data from Listing 23.7. 


This is telling us that samples_train is a 3D block of 60,000 layers. 
Each layer holds a 28 by 28 image. The labels_train variable is a 1D 
list of 60,000 elements (we'll see that each is a number from O to 9). 
The extra comma at the end of (60000, ) is a Python convention to tell 
us that this is a list of 60,000 elements, and not just the number 60,000 
surrounded by parentheses [Wentworth12]). Similarly, samples_test 
is an array of 10,000 images, each 28 by 28, and labels_test is a list 
of integers with the test data’s corresponding labels. 


Although these variable names are perfectly fine, a common code con- 
vention is to use the capital letter X to refer to a data set’s samples, 
and a lower-case letter y to refer to its labels. We saw this in the docu- 
mentation snippet in Listing 23.8. These letters were chosen to match 
the letters used in many deep-learning equations. The carry-over was 
natural in early programs that were written to closely match the equa- 
tions, and the convention stuck. The lower-case x is also used for the 
samples, and the upper-case Y for the labels, though those are less 
common. 


Using this convention, we’d write Listing 23.7 more succinctly as 
Listing 23.10. 


(X_train, y_train), (X_test, y_test) = mnist.load_data() 


Listing 23.10: Loading MNIST data, using X for samples and y for labels. 
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Using X and y is a nice convention once we’re used to it, because these 
single letters can save us a lot of typing, and they are quickly under- 
stood by anyone who’s used to this naming scheme. 


There’s no rule that says we have to use these cryptic variable names, 
even if they are conventional. The style of using X for features and y 
for labels is so frequently used that it’s probably a good thing in the 
long run to use it, and we'll do so here. But every programmer should 
follow their own instincts for writing code that is clear and useful to 
themselves and others. 


23.5.3 Looking at the Data 


The first step in using any database is to look at it. We want to make 
sure that it’s clean and organized in a useful way. We also want to gen- 
erally get a feeling for what we’re working with. 


If the data needs to be modified before we use it for learning, we can use 
a combination of straight Python programming, and functions from 
libraries such as NumPy, SciPy, scikit-learn, and Keras itself. Such 
pre-processing is a vital step in making sure our network will work the 
way we want, and prevent errors. Happily, the MNIST dataset needs 
only a little bit of this work, so we can present it all here to get a flavor 
for the process. 


There are at least two potential sources of problems to keep an eye 
out for. Content problems are numerical issues with the data itself, 
while structural problems are issues regarding how the data is 
organized. 


Let’s literally look at the data first. Figure 23.12 shows another ran- 
dom sampling of images from the training data. We can see that the 
examples are not all perfect. 
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Figure 23.12: A random sampling of the images from the MNIST training 
set. 














? 





There are four standout issues. 


First, some of the images bleed very close to the edge of the 28 by 28 
box, rather than sitting inside a relatively thick black border of 4 pixels 
all around that the original paper describes [LeCun13]. Some exam- 
ples from the training set that have this quality are shown in Figure 
23.13. 
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4880 3442 11947 7195 4759 3382 2133 7192 2380 
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Figure 23.13: Some images from the MNIST training set that demon- 
strate a bleeding of the image very near, or right up to, the border. The 
number above each example shows its index in the training set. 





Second, some of the digits appear to have had pieces cropped away, 
substantially changing their shape. Figure 23.14 shows a few examples. 


12184 Wi 3002 9363 2231 26447 = 28491 5052 55475 


Figure 23.14: Some images from the MNIST training set that have been 
cropped, chopping away some of what seems very likely to have been 
drawn, and sometimes creating multiple, disconnected pieces. 





Third, some of the images are noisy. Sometimes this means that lines 
thin out or disappear. More often there are spurious regions of white, 
perhaps due to errors during cropping or thresholding. These don’t 
usually cause much confusion to human observers, but these artifacts 
have the potential to throw off a computerized network. Figure 23.15 
shows some examples. 


51323 51363 51459 53205 55539 26471 25159 7599 10677 58871 


21718141916 fol ays 


Figure 23.15: Some images from the MNIST training set that demon- 
strate noise artifacts. Some of these might be due to thresholding or 
cropping errors. 





Finally, there are some examples that seem challenging to interpret, 
either because of how they were drawn, or how they were processed. 
Figure 23.16 shows a collection of some of these oddball training 
examples. 
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50239 50856 16676 26398 26624 27514 29897 
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Figure 23.16: Some images from the MNIST training set that appear 
particularly challenging to categorize. 


We might be tempted to remove samples that have the artifacts we just 
looked at, but in fact as long as there aren’t too many of them, they can 
make our system stronger. If our network can correctly identify these 
images despite their imperfections, then it has a robust quality that it 
wouldn’t have without these stressful examples. 


After browsing several random chunks of the data, we concluded that 
these problems were infrequent enough that we wouldn’t bother to 
remove them. Even though we're taking no action, it was important to 
look the data over and reach this conclusion based on the data, rather 
than a hopeful guess. 


Now we'll turn to the structure of the data, and see how it’s organized. 


Our main interest is in the shapes of the variables that we got from 
mnist.load_data(). Listing 23.11 recaps our starting objects using 
the shorthand xX for samples and y for labels. 
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print('X_train shape:', X_train.shape, 
'y_train shape:', y_train.shape) 

print('X_test shape:', X_test.shape, 
'y_test shape:', y_test.shape) 


X_train shape: (60000, 28, 28) y_train shape: (60000, ) 
X_test shape: (10000, 28, 28) y_test shape: (10000, ) 


Listing 23.11: Printing shape information about our input data. 


Our training data, X_train, is in a 3D block. Using our (away, down, 
right) convention, it’s 60,000 slices deep, where each vertical slice is 
28 by 28 units. Figure 23.17 shows this shape. 





Figure 23.17: Our training data, X_train, has shape 60000 by 28 by 28. 
That means it’s a stack of 60,000 objects, each an image that’s 28 by 28 
pixels. 


The test data is set up the same way, except the stack is only 10,000 
images deep. 


We're going to reshape our data in the following sections, so let’s stash 
the original height and width of each image in a variable. We'll also 
multiply them together and save that as the total number of pixels per 
image. Listing 23.12 shows how we'll save this data. 
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image_height = X_train.shape[1] 
image_width = X_train.shape[2] 
number_of_pixels = image_height * image_width 


Listing 23.12: Saving the sizes of our input data for later use. 


This is a bit of overkill, since for this fixed data set we know every image 
is 28 by 28, but this more general approach will make it easier to later 
copy this code and adapt it to a new data set. 


The labels are given to us as one-dimensional lists. The training label 
list y_train has, as expected, a length of 60,000, since it’s providing 
one label for each sample in the training set. Let’s look at the first few 
elements in Listing 23.13. 


print('start of y_train:', y_train[:15]) 
start of y_train: [5504192131435 3 61] 


Listing 23.13: The first few elements of the labels in y_train. 


So each entry in y_train is an integer. We expect it to be the label of 
the corresponding image in X_train. It always pays to check, so let’s 
look at the first 15 images in X_train, shown in Figure 23.18 


5 0 4 1 9 2 1 4 3 5 3 6 


sto /lalal als) Ts] er 


Figure 23.18: The first 15 images in X_train. These match the labels in 
y_train, shown above each sample, so we're good. 


Great, the labels in y_train match the corresponding images in 
X_train. Since the MNIST data is so well known we can stop here, but 
with less familiar data sets we’d probably want to make at least several 
of these spot checks throughout the data to make sure that the two 
lists stay in sync. 
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Now let’s look at the data itself. In Listing 23.14 we print an arbitrary 
little rectangle from within the first image of X_train. A handy bit of 
Python to keep in mind is that by simply typing the name of a variable 
to the interpreter (rather than using a print statement), we sometimes 
get more information about the variable. 


X_train[0, 5:12, 5:12] 


array([[ 0, 0, 90, 0, 0, 90, OJ], 
[ 0, 0, 0, 30, 36, 94, 154], 
[ 0, 0, 49, 238, 253, 253, 253], 
[ 0, 0, 18, 219, 253, 253, 253], 
[ 0, 0, 0, 80, 156, 107, 253], 
[ 0, 0, 0, 0, 14, 1, 154], 
Lo: 0, 0, 0, 0, 0, 139]], dtype=uints) 


Listing 23.14: A small rectangle from the first training image in X_train. 


The variable dtype at the end tells us that this is a NumPy array, repre- 
sented by the data type uint8, which means an unsigned 8-bit integer. 
Checking X_test reveals the same structure. As we might expect from 
grayscale image data, all of the values are between 0 and 255 (more on 
that below). 


Are the labels also NumPy arrays? Listing 23.15 shows a piece of the 
y_train array. 


y_train[:15] 


array([5, 0, 45 2,9, 2, 1, 3, 1, 45.3; 5, 33.6, 1], 
dtype=uints) 


Listing 23.15: A piece of the y_train array. The input is on the first line, 
the rest is output. 


Yup, this is a1D NumPy array of unsigned 8-bit integers. That’s pretty 
restrictive for training labels, since such numbers can’t go above 255. 
But here we're only storing the labels 0 to 9, so the range from O to 255 
is plenty. 
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To use this data for training with Keras, we need to turn the training 
and test sample data into normalized floating-point numbers, and turn 
the labels into one-hot encodings (as discussed in Chapter 12). 


But before we do that we'll take a quick pause. The MNIST data is con- 
veniently already split into training and test sets. What if it wasn’t? 
There’s a nice utility that will split our data for us. Let’s look at it now. 


23.5.4 Train-test Splitting 


Most data sets require us to manually split them into training and test 
sets. The MNIST data has already been split for us, but for complete- 
ness, let’s see how we'd do the job if we had to. 


The easiest and most common approach is to use scikit-learn’s 
train_test_split() function to do all the work for us. Suppose that 
the MNIST data came to us as only two tensors, called samples and 
labels, and we want to split it into a training set and a test set. A typi- 
cal test set is often around 20% or 30% of the starting data, so let’s go 
down the middle with 25%. 


We just call train_test_split() with our data and the split size, and 
it returns four arrays, as in Listing 23.16. 


from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = 
train_test_split(samples, labels, test_size=0.25) 


Listing 23.16: Splitting data into a training set and a test set using 
train_test_split() from scikit-learn. 


Figure 23.19 shows this operation visually. 
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Figure 23.19: Splitting a dataset of 60,000 images. We're using 25% of the 
data for the test set and the other 75% for the training set. The function 
train_test_split() doesn’t simply cut the input data in one place 
as shown here, but shuffles a copy of the data first so it’s more likely that 
each of these two pieces will contain a good mix of all the samples. 


Note that train_test_split() gives us back four arrays, not the two 
arrays of two elements each that were returned by mnist. load_data() 
They’re also in a slightly different order compared to in Listing 23.10. 
These kinds of minor inconsistencies between libraries can be a hassle 
until we get used to them. 


One way to catch these inconsistencies before they become major 
debugging problems is to go slowly and build up our code one line at 
a time in an interactive Python environment, as we discussed earlier. 
When we do something wrong, we'll get an immediate error that we 
can investigate more closely by printing things out, and comparing 
what we’re doing with what the library documentation describes we 
should be doing. 


When learning a new library, lots of little experiments can help us 
write good code from the start. 
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23.5.5 Fixing the Data Type 


As we saw in Listing 23.14, the sample data we get from 
mnist.load_data() is returned to us as integers. Though this is effi- 
cient and reasonable for storing the data, Keras wants to work with 
floating-point numbers. To prevent making incorrect assumptions, 
Keras won’t automatically cast, or convert, number types for us. That’s 
our job, and it’s mandatory. Keras expects floats and it better get them, 
or it will either go haywire at some point, or more usually, report an 
error and stop. 


In fact, Keras expects the specific type of floats that match its inter- 
nal floatx parameter. We saw above in Listing 23.1 that we can 
assign that parameter to different data types in the keras configura- 
tion file by assigning a new value to floatx, or in our code by calling 
set_floatx() in the Keras backend, as in Listing 23.2. 


By default, floatx has the value float32, meaning a 32-bit float- 
ing-point value. Unless we change the configuration file, or call a 
backend function to change this in our code, this is the type that Keras 
expects. 


Switching this to another data type (such as float64) is easy to do, but 
knowing when such a choice makes sense is complex and dependent 
on one’s specific hardware and software, so we'll stick with float32. 


Now that we know the format Keras expects for our floating-point 
numbers, we can return to our job of converting our samples into that 
form. The easy way to do this is to use the function cast_to_floatx() 
from the Keras backend, which takes a tensor as an argument and 
casts every element of that a tensor into the type specified by the cur- 
rent value of floatx. The routine doesn’t even care about the shape of 
the tensor. From a 1D list to some giant tensor with a thousand dimen- 
sions, the routine will simply crank through every entry and convert 
it to our desired type of data. Note that the last word in this routine’s 
name is not float, but rather floatx, referring to the configuration 
variable. Listing 23.17 shows how to use it. 
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from keras import backend as keras_backend 
X_train = keras_backend.cast_to_floatx(X_train) 
X_test = keras_backend.cast_to_floatx(X_test) 


Listing 23.17: Using the Keras backend to change our array types to the 
value it expects. 


We might be tempted to cast the y_train and y_test arrays to the 
floatx type also, but that’s not necessary. We'll be converting these 
arrays into their one-hot forms below using another utility routine, 
and that routine expects a list of integers as input. This is yet more of 
the kind of details that make for slow going when first getting used to 
a new library. 


Now that our features have the right type, we can move on to making 
sure they have the most useful range of values. 


23.5.6 Normalizing the Data 


Another important step in preparing data is normalizing it. This 
can mean slightly different things in different contexts, but it always 
means changing the data itself, rather than simply re-shaping it. 


The networks that we'll be building in this chapter to categorize the 
MNIST data will use convolution layers near the start, and those will 
work best with data that has been normalized so that each feature has 
been scaled to fit the range o to 1. 


Note that normalization is just for the features, and not the labels. The 
labels need to refer to the 10 different classes from 0 to 9, and we don’t 
want to change those values. 


Listing 23.14 showed us that our feature data in X_train and X_test 
is originally made of integers in the range 0 to 255, This is a common 
range for a channel of image data. We’ve just converted these values to 
32-bit floats, so we could say that they’re now in the range 0.0 to 255.0. 
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We said above that we need to normalize our data to the range 
[o.0, 1.0]. As we saw in previous chapters, this helps to keep neuron 
outputs in the same range, which helps with regularization and delay- 
ing the onset of overfitting. And if we’re using an activation function 
like a sigmoid, it keeps our functions from saturating. 


We could accomplish this normalization with a full pre-processing 
step. We’d examine the values of the pixels in the training data, build a 
transformation to scale them to [0,1], and then apply that transforma- 
tion to the training data, the test data, and any future data. We could 
create one of scikit-learn’s transformation objects, train it, and then 
apply it to our data. 


That’s a perfectly good way to proceed, but when we're working with 
image data like that in the MNIST data set, we almost always trans- 
form our data with a simpler and more direct approach. 


We know that our pixels in the training and test data are in the range 
[O, 255]. All we want is to rescale all the pixels in the same way, com- 
pressing them from the range [0, 255] to the range [0,1]. Conceptually, 
this is like converting measurements in millimeters into kilometers, or 
vice-versa. 


We can scale our input data with Numpy’s interp() routine, which 
is designed for exactly this job. It takes an array (or tensor), an input 
range, and an output range. For each entry it will find its location in 
the first range (O to 255) and find its corresponding position in the 
second range (0 to 1). Listing 23.18 shows the code. 


X_train = np.interp(X_train, [0, 255], [0,1]) 
X_test = np.interp(X_test, [0, 255], [0,1]) 


Listing 23.18: Scaling pixels from [0, 255] to [0,1]. 


This works perfectly, but since we know our data is in the range oO to 
255, we can accomplish the same thing just by dividing all the pixels 
by 255.0, as in Listing 23.19. 
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X_train /= 255.0 
X_test /= 255.0 


Listing 23.19: Rescaling our pixels to [0,1] by dividing them by 255. 


Listing 23.18 and Listing 23.19 do exactly the same job, Although the 
second approach is a little less explicit about what’s going on, it’s both 
shorter to write and ever-so-slightly faster to execute than the version 
that uses interpolation. 


These reasons are probably why Listing 23.19 is the common idiom for 
scaling images. Keeping with that convention, we'll use it here as well. 


Let’s gather everything we’ve seen so far in one place. We'll import 
the modules we need, read in the data with Listing 23.10, save the 
sizes with Listing 23.12, convert it to floating-point with Listing 23.17, 
and scale it to [0,1] with Listing 23.19. This is all bundled together in 
Listing 23.20. 


from keras.datasets import mnist 
from keras import backend as keras_backend 


# load MNIST data and save sizes 

(X_train, y_train), (X_test, y_test) = mnist.load_data() 
image_height = X_train.shape[1] 

image_width = X_train.shape[2] 

number_of_pixels = image_height * image_width 


# convert to floating-point 
X_train = keras_backend.cast_to_floatx(X_train) 
X_test = keras_backend.cast_to_floatx(X_test) 


# scale data to range [0, 1] 


X_train /= 255.0 
X_test /= 255.0 


Listing 23.20: Reading in our data, saving the sizes, converting to floats, 
and scaling to [0,1]. 
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Our training and test samples are now in floating-point format and 
scaled from 0.0 to 1.0. 


This is the end of pre-processing for the samples. We need to remember 
in the future that if we get any new samples that we want to evaluate 
with this network, they too need their pixel data to be converted to 
32-bit floats and divided by 255. 


There’s a subtle point that’s important to note. Any new images we 
get after training is complete should not be simply scaled to the range 
[0,1]. Instead, we need to apply the identical pre-processing that we 
applied above, meaning that the new image data needs to be divided 
by 255. If for some reason there are values in that image less than o 
or greater than 255, then they will turn into floating point values less 
than 0 or greater than 1. That might be inconvenient in some way, but 
we can’t avoid it, because we must use the same transformation on the 
new data that we used on the data we train with. 


Now let’s pre-process the labels so that they’re ready for use. 


23.5.7 Fixing the Labels 


We know that the MNIST data contains images of digits from 0 to 9. 
So in our network we'll create an output layer with 10 neurons, one for 
each digit. Each neuron will produce a probability that the image it’s 
just been fed corresponds to that digit. The neuron with the highest 
value will be the network’s final prediction for the input. 


We'd like to compute an error value that tells us how close these 10 
values are to the values we want. To make this comparison easy, we 
represent the label for each image using one-hot encoding, as we 
discussed in Chapter 12. In this case, it’s a list of 10 elements, where all 
are O except for 1 in slot 3. Figure 23.20 shows the idea visually. 
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Figure 23.20: Computing the error. We feed an image (here a picture of 
a 3) to our network, and we get back a probability from O to 1 for each 
possible label from O to 9. We compare these 10 numbers with the 10 
values in the one-hot label. The more the prediction is like the label, the 
smaller the error. 


In this imaginary example, the network has given the value 3 the 
greatest probability, but it’s given each of the other digits some chance 
of being right, too. A perfect answer from the network would be a 
probability of 1 that the input is a 3, so all other choices would have 
a probability of 0. In other words, a perfect prediction would be the 
same as the label. The more the two are different, the higher the error. 
The one-hot form of the label simplifies this comparison of the output 
and the label. 
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It might seem like one-hot encoding is superfluous, since a network 
could do this operation on the fly when it’s needed. That’s true, but 
that step would have to be repeated for every sample during training. 
If we trained for only one epoch (that is, every sample is used once) 
then it wouldn’t matter if we used a pre-processed label or created it 
only when we needed it. But if we train for, say, 200 epochs, then we’d 
have to repeat the on-the-fly encoding of every sample 200 times. It’s 
faster to encode the values just once before we start training. Providing 
pre-encoded labels also lets us create labels with values other than just 
o and 1, if we prefer. 


So we'd like to turn the integers we get back in the variables y_train 
and y_test into one-hot encoded versions. 


Turning each integer in a list into a one-hot encoding is such acommon 
task that Keras provides a utility for it. The routine to_categorical() 
looks through an array of integers and finds the largest value, so it 
knows how many 0’s are needed to represent all the values that need 
to be encoded. It then makes a one-hot encoding for each integer in 
the list. The output of to_categorial() is a list of these encodings, 
which are themselves lists of 0’s and 1’s. 


Let’s see one-hot encoding in action. Listing 23.21 shows the first 5 
entries of the original y_train array before and after they've been one- 
hot encoded. 
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from keras.utils import to_categorical 


# print the first 5 entries of the ortgtinal y_train array 
y_train[:5] 


array([5, 0, 4, 1, 9], dtype=uints) 


# encode the y_tratin array as one-hot lists 

y_train = to_categorical(y_train) 

# print the new first 5 entries of y_tratin, now one-hot encoded 
y_train[:5] 


(array([[ 0., ©., 0., 0., O., 1., 0., 0., O., 0O.], 
[doy G., Oc, On, 0.2, Of, O.,. 6.5 02, O- 1, 
[ 0., 0., 0., 0., 1., 0O., O., 0., O., O.], 
(yO; i. O24, 054 °02, 0.50 0. °° 6-5 0. O15 
[ 0. Gove Ole Cl). Oy Ol. en Oe Os le 11) 
AtyoeC: float64')) 


Listing 23.21: Before and after one-hot encoding the y_train array with 
the to_categorial() utility function. 


As we can see, the output is a 2D grid with one row for each input. 
Every entry is O except for a single 1, located at the index correspond- 
ing to the original y_train value for that row. 


The one-hot values produced by to_categorical() are in 64-bit float- 
ing-point form. Happily, floating-point is just fine, since Keras will be 
comparing these floating-point values with the floating-point values 
coming out of our neural network. It’s a bit strange that Keras doesn’t 


use the default floatx type when it produces this data, but the 64-bit 
floats work fine when training our network. 


We might be tempted to simply pass y_train and then y_test to 
to_categorical() in succession and move on, but that could intro- 
duce a subtle bug. The problem is that the largest value in one list 


might be different than the largest value in the other, giving us lists of 
different sizes. 
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For instance, suppose that the test data was missing any images of 
the digit 9. That means that y_test will contain only the digits 0 to 8. 
When we use to_categorical() we'll get back a list that has only 9 
items. This will cause trouble later when we want to compare it to the 
values in our output layer, which has a score for each of 10 categories. 


We don’t have to worry about this problem with the MNIST data, 
because it has examples for every image in both sets, but it might come 
up in other data sets. 


There’s an easy, general solution that will always avoid this problem. 
It involves using an optional argument to to_categorial() that over- 
rides its scanning step. This argument, called num_classes, tells the 
routine to always make lists of the given length. The prefix num_ is a 
common convention which is read as “number of,” so num_classes 
stands for “number of classes.” 


The value of num_classes has to be at least big enough to encode all 
the possible values, or we'll get an error. If num_classes is bigger than 
necessary, that’s fine, and the extra values at the end will always be o. 


To make sure both encodings will be the same size for any two lists of 
labels, we will combine all the labels into one big list and extract its 
largest value. Since we’re starting with 0, we'll add 1 to the result, and 
that’s the smallest size of the list that can encode all the values in all 
the labels. 


Listing 23.22, show how to use to_categorical() to turn our list of 
integer labels into a list of one-hot encodings in a general way. 


# combine the input lists to find largest value 
# in either list, then add 1 because the values start at 0 
number_of_classes = 1 + max(np.append(y_train, y_test) ) 


# encode each list into one-hot arrays of the size we just found 


y_train = to_categorical(y_train, num_classes=number_of_classes) 
y_test = to_categorical(y_test, num_classes=number_of_classes) 


Listing 23.22: The label arrays are replaced with one-hot encodings. 
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Sometimes we want the original list of integers somewhere else in the 

program, as we'll see later when we do cross-validation. We can “undo” 
the one-hot encoding in two ways. If the one-hot encoding is repre- 
sented as a regular Python list (that is, not a NumPy array), we can use 

Python’s built-in index() method, as in Listing 23.23. 


one_hot = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] 
print('one-hot represents the integer ',one_hot.index(1) ) 


one-hot represents the integer 3 


Listing 23.23: Using Python’s index() method to “undo” one-hot 
encoding. 


If the one-hot version is a NumPy array, then we can’t use index(), 
because NumPy doesn’t support that method. There are several ways 
to use NumPy to find the index of a single 1 in list of 0’s. Listing 23.24 
shows one way to do it. This uses NumPy’s argmax() method, which 
returns the index of the largest value in a list. 


one_hot_np = np.array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0]) 
print('one_hot_np represents the integer ',np.argmax(one_hot_np) ) 


one_hot_np represents the integer 3 


Listing 23.24: Using Python’s index() method to “undo” one-hot 
encoding. 


Rather than use either of these methods to find the integer versions of 
the one-hot encodings, we'll just save the original integer lists before 
we call to_categorical(), as in Listing 23.25. 


# save the original y_train and y_test 
original_y_train = y_train 
original_y_test = y_test 


Listing 23.25: Saving the labels in their original format as lists of integers. 
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Just for reference, Listing 23.26 provides a Python one-liner that will 
undo one-hot encoding, for those times when we're given the data 
already in one-hot form. 


original_y_train = [np.argmax(v) for v in y_train] 
original_y_test = [np.argmax(v) for v in y_test] 


Listing 23.26: Turning our one-hot encoded targets back into lists of 
integers. 


Because one-hot encoding is so common, scikit-learn also offers a 
tool to perform it. It’s in the preprocessing module, and is called 
OneHotEncoder (). 


23.5.8 Pre-Processing All in One Place 


We've just reached the first mountaintop! It’s been a long way, but 
we've done a lot. Starting from an empty slate, our data is now ready 
for training. 


To recap, we began by reading in (and possibly downloading) the 
MNIST data, and then prepared each image for Keras by changing it 
from integers to floats, and then normalized it. Then we created one- 
hot encodings of our labels. 


Listing 23.27 brings all of these pre-processing steps together in one 
place. We’ve also added a line to seed NumPy’s random number gen- 
erator. This means that any random numbers we get from NumPy will 
always be the same from one run to the next. Though we’re not using 
random numbers yet, we will be using them later. Forcing our random 
numbers to always come out the same in each run makes debugging a 
lot easier. 
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from keras.datasets import mnist 

from keras import backend as keras_backend 

from keras.utils.np_utils import to_categorical 
import numpy as np 

random_seed = 42 

np.random.seed(random_seed) 


# load MNIST data and save sizes 

(X_train, y_train), (X_test, y_test) = mnist.load_data() 
image_height = X_train.shape[1] 

image_width = X_train.shape[2] 

number_of_pixels = image_height * image_width 


# convert to floating-point 
X_train = keras_backend.cast_to_floatx(X_train) 
X_test = keras_backend.cast_to_floatx(X_test) 


# scale data to range [0, 1] 
X_train /= 255.0 
X_test /= 255.0 


# save the original y_train and y_test 
original_y_train = y_train 
original_y_test = y_test 


# replace label data with one-hot encoded versions 
number_of_classes = 1 + max(np.append(y_train, y_test) ) 

y_train = to_categorical(y_train, num_classes=number_of_classes) 
y_test = to_categorical(y_test, num_classes=number_of_classes) 


Listing 23.27: Combining the fragments above to create a complete 
pre-processor. 


Part of the appeal of using libraries like scikit-learn and Keras is that 
there is remarkably little fiddling about with additional Python code 
to get things done. Almost every line of Listing 23.27 is either doing 
a specific pre-processing step, or saving variables that we'll use again 
later. 
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In this code we’re repeatedly over-writing the data in X_train and 
X_test, and the labels in y_train and y_test. This is a common 
approach during pre-processing, because we don’t care about the start- 
ing or intermediate values. The upside is a degree of simplicity. The 
downside is that if we want to access the original data, we either have 
to save it (as we do here for the labels), or load a fresh copy of the data. 


23.6 Making the Model 


This section’s notebook is 
Keras-Notebook-02-Making-the-Model.ipynb 


Now that our data is ready for use, let’s build our deep learning model. 


The beauty of model-making in Keras is that creating the structure of 
our model (that is, our neural network’s architecture) is streamlined. 
There are only two steps. 


First, we name the layers we want in the order we want them. This is 
called specifying the model. 


Second, we tell Keras how to use this model to learn. We tell it which 
loss function and optimizer to use, and what data we’d like it to collect 
along the way. This is called compiling the model. The compilation 
step converts our specification into code that runs on the backend 
we ve chosen. 


Our first model for classifying MNIST data will be simple. It will have 
an input layer (which is implicit in every network), a single hidden 
layer, and an output layer. The hidden and output layers will both 
be fully-connected, or dense, layers. Figure 23.21 shows our first 
deep-learning system. 
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Figure 23.21: Our first, very simple deep-learning model consists of a 
fully-connected layer of 784 neurons (one for each input pixel) followed 
by a fully-connected layer of 10 neurons (one for each output class). 


Recall that in drawings like Figure 23.21, we don’t draw the input 
layer, because it’s just a memory buffer. By convention, data flows left 
to right, as we’re doing here, or sometimes instead bottom to top. The 
labels at the ends show the size and shape of the data going into and 
coming out of the network. 


We've decided to set up our first layer to have a single neuron for each 
pixel. This is a common way to configure the first layer, but it’s defi- 
nitely not required. We could use 5 neurons or 5000 if we thought that 
would produce better results. 


Using this “one neuron per input pixel” approach for our 28 by 28 
images, our first layer requires 28x28=784 neurons. 


Wait a second. We saw above that our input is a list of 2D grids, each 
28 by 28. Why are we setting up our network to expect a flat list, rather 
than a 2D grid? 


We're not doing that on purpose. A full-connected layer can only take 
in a 1D list. There’s no processing inside of a dense layer would let 
it figure out how to get at the pixels in a 2D data structure. We'll see 
later that convolution layers have that processing, so we can give them 
grids directly. But right now we’re using a dense layer, and the input to 
a dense layer is a list. 


So we need to convert each input sample of 28 by 28 pixels into a 1D 
list of 784 values. 
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23.6.1 Turning Grids into Lists 


There are at least two ways to do this. The first is to build it right into 
our neural network, using the Reshape utility layer provided by Keras. 
The second is to reshape the data ourselves before training. 


The first approach has simplicity going for it. We just make a Reshape 
layer and stick it ahead of the Dense layer and we’re done. The down- 
side is that every sample will get reshaped every time it’s evaluated, and 
that will take some time. Since we expect to be running all the train- 
ing samples through the network multiple times (that is, we’ll train for 
multiple epochs), it’s more efficient to pre-process it ourselves once. 
Recall that this is the same logic that led us to pre-process our labels 
into one-hot versions. 


To convert our images into a list, we'll convert our starting 3D input 
data into a 2D grid. Each row of the grid is one sample, made up of a 
list of 784 features. The result is shown in Figure 23.22. 


sample 1 


sample 2 





Figure 23.22: Turning our 3D input into a 2D grid, containing one long 
row of pixels for each image. 
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This is easy to do using Numpy’s reshape() function, discussed above. 
We'll tell it to re-interpret X_train, which it is thinking of as a 3D 
block with dimensions 60,000 by 28 by 28, instead as a 2D array with 
dimensions 60,000 by 748. 


As we discussed above, there are two ways to use reshape(). Let’s first 
use the version where we call it from Numpy and pass it the array we’re 
reshaping as the first argument. 


The second argument to reshape() is a list with the new dimensions. 
In this case, the second argument is the list [60000, 748]. To make it 
easier to re-use this code for other projects later, we'll get these num- 
bers from the data rather than typing them in directly. Recall that 
number_of_pixels has been set in our pre-processing step of Listing 
23.27 to be the size of each input image, or 784. 


For simplicity, we'll continue to over-write our values of X_train and 
X_test with these new versions. Listing 23.28 shows the code. 


# reshape samples to 2D grid, one line per image 
X_train = np.reshape(X_train, 
[X_train.shape[0], number_of_pixels]) 
X_test = np.reshape(X_test, 
[X_test.shape[0], number_of_pixels] ) 


Listing 23.28: Flattening our images into a 2D grid, so each sample is just 
a single list of numbers. This is the format we need for a dense layer, like 
our first layer in Figure 23.21. 


As we discussed, the other way to call reshape() is to call it as a 
method on the object being reshaped. In this case, the only necessary 
argument is the list containing the new dimensions. Because this is 
also common, we demonstrate this in Listing 23.29. 
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# reshape samples to 2D grid, one line per image 
X_train = X_train.reshape([X_train.shape[0], number_of_pixels]) 
X_test = X_test.reshape([X_test.shape[0], number_of_pixels]) 


Listing 23.29: Another way to reshape our images into a 2D grid. The 
results are identical to those of Listing 23.28. 


Both of these variations produce the same results, so we can use 
whichever one we prefer. We'll use the shorter, second version in the 
following discussion. 


This re-shaping step is properly part of the pre-processing section, 
because we only need to do it once, so we'll place it there in the listings 
below. 


We'll see later that other types of layers, such as convolution layers, 
will want their data to be shaped in other ways. Getting the data into 
the right structure is an essential step in training neural networks. 


23.6.2 Creating the Model 


Now that our data is fully processed, we can build the model. 


We start by telling Keras the overall architecture of our model. Our 
choices are basically “a list of layers,” and “anything else.” 


The “list of layers” architecture is called the Sequential model. That’s 
perfect for us, since our architecture of Figure 23.21 is just two dense 
layers one after the other. In other words, they can be described as a 
2-element list starting with the hidden layer and ending with the out- 
put layer. 


The “anything else” architecture is called the Functional model. This 
is more flexible than the Sequential model, but requires a little more 
work from us. We'll come back to the Functional model later. 


We build a model in the Sequential style using the Sequential API, 
which is a collection of library calls designed to make this process easy. 
The beauty of the Sequential API is that to create our model we just 
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name our layers in order from start to finish. This lets Keras automat- 
ically work out how each layer connects to the one before and after, so 
it can manage the flow of data from one layer to the next automatically. 
This is a great time-saver both in programming and debugging. 


To build our model, we create a variable to hold a Sequential object. 
This is initially an empty layer of lists. Then we add our layers to that 
object. 


The first time we add a layer to our model, Keras will automatically cre- 
ate an input layer for us to hold the incoming data. Then it places our 
new layer after that. We could stop right there if we wanted, and that 
would be a 1-layer neural network (remember that we usually don’t 
count the input layer, since it doesn’t do any processing). 


But we can keep going, and add as many more layers as we like. Each 
new layer takes its input from the most recently added layer. The last 
layer we add in is implicitly our output layer. We never explicitly say 
that we’re starting or ending. We just add in layers until we’re done. 


Listing 23.30 shows the first step, where we create the Sequential 
object and save it in a variable. 


from keras.models import Sequential 
model = Sequential() 


Listing 23.30: Creating an empty deep-learning architecture in the 
Sequential style. 


A quirk of this approach is that the layers appear in the code in exactly 
the opposite order that we normally draw them. As we’ve seen, the 

drawing convention is to show the layers going rightwards or upwards. 
But in the source code, each new layer appears under the one that pre- 
cedes it, so reading the code downwards corresponds to reading the 

figure rightwards or upwards. This can take a little getting used to, but 

eventually the mental flip becomes second nature. 
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Let’s start building our model. The first layer is always the input layer. 
But recall that the input layer is implicit. We don’t usually draw it, or 
count it, and in the Sequential model we usually don’t even explicitly 
make it. 


This is fine, because the input layer does nothing but hold the feature 
list for a sample. So the only thing we need to tell Keras about the input 
layer is how big that list should be, and it will make the appropriate 
storage for us. 


We tell Keras the size of the input layer with an optional argument 
called input_shape. We pass a value to this argument in the first layer 
only. In other words, this argument must be included when we make 
our first layer, but must not be in any others. Every type of layer that 
can serve as the first layer in a sequence (including the fully-connected 
layer we'll be using), takes input_shape as an optional parameter. 


Let’s make our first layer. 


Our diagram of Figure 23.21 specifies that our first layer is a fully-con- 
nected layer. 


Keras calls a fully-connected layer a dense layer. Note that here the 
word “dense” refers to how the layer connects to the layer that pre- 
cedes it. In other words, every neuron in this layer will be connected 
to every output from the previous layer. We are saying nothing at all 
about what happens to the outputs of the neurons on this layer. Keras 
will only discover where they go and how they get used when we spec- 
ify the next layer in the description. If there is no next layer, then the 
outputs of this layer are the outputs of the whole system. 


Since most layers are in the midst of a stack, we usually refer to the 
neurons receiving data from neurons in the “previous” layer. In the 
special case when the previous layer is the input layer, those neurons 
get their data from the values of the input saved on that layer. 


Figure 23.23 shows a dense layer in schematic form. 
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Figure 23.23: A schematic view of a Dense layer. The three colored 
neurons make up the dense layer. Each of them connects to every neuron 
in the preceding layer (in gray). When we create this layer, we’re only 
declaring the nature of its connections to the layer before it, and we’re 
saying nothing about what happens to its outputs. 


To add a dense layer to our model, we create a Dense object and then 
append it to the end of our model’s sequence of layers. Although the 
Dense object has many arguments, we'll only use three of them right 
now. In standard Python convention, the first argument (which is 
mandatory) is not named, but the others are named and may appear 
in any order. 


The necessary first argument is the size of the layer. This is just the 
number of neurons. This can be, and often is, different from the num- 
ber of nodes in the preceding layer. For instance, the previous layer 
(whether it’s the input layer, or has neurons) might have 4 outputs. 
Our Dense layer could have fewer than that, or the same amount, or 
more, as shown in Figure 23.24. 





(a) (b) (Cc) 


Figure 23.24: Our fully-connected layer is shown with colored neurons, 
connecting to a previous layer with gray neurons. The number of neurons 
in the fully-connected layer is independent of the number of neurons in 
the layer that precedes it. 
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As we discussed above, for our first classifier we'll use the same num- 
ber of neurons as there are pixels in the inputs. This is a common way 
to set up an image classifier, but we might later find that the system 
learns better with fewer nodes in this layer, or more. In the back of our 
minds we can consider this a variable to play with later on, to see what 
value gives us the best performance. 


The first optional argument we'll use tells Keras which activation unit 
to place after each neuron in the layer. We can specify any one of the 
functions built into Keras (and, as usual, listed in the documenta- 
tion) by supplying a string. Common choices are 'relu' and 'tanh' 
for the ReLU and tanh functions in hidden layers, and 'softmax' or 
'sigmoid' for the output layer. The default is 'None', or the linear 
activation function, so for internal layers we'll almost always want to 
specify one of the other choices. 


The second optional argument we'll use is input_shape, which defines 
the size of each dimension in the input. As we saw above, we use this 
only for the very first layer in a model. The value of this argument is a 
list that tells Keras to build an input layer of the given shape and size, 
which must match the shape and size of each sample we'll be providing. 


Since each of our samples (after processing) is a 1D list of 784 numbers, 
we'll tell Keras that our input_shape is a 1D list of 784 numbers (using 
the variable number_of_pixels that we saved during pre-processing). 


Listing 23.31 shows how to create our first Dense layer. 


from keras. layers import Dense 

# create the Dense Layer 

dense_layer = Dense(number_of_pixels, activation='relu', 
input_shape=[number_of_pixels] ) 


Listing 23.31: Creating our first Dense layer. We need to import the 
Dense object from keras. Layers to access it. Because this is the first 
layer in the model, we provide a value for input_shape. 
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Once we’ve made our Dense layer, how do we add it to our model? 
Curiously, although Python has a built-in operation called append () 
that adds one element to the end of a list, Keras doesn’t use that name 
for this operation, which is conceptually the same. Instead, it uses the 
ambiguous name add(), in its colloquial sense of “add another log 
to the fire,” rather than its numerical sense of “add 2 and 4.” It may 
be useful to think of the Keras add() routine as though it had a more 
descriptive name such as “append.” 


Listing 23.32 shows the code for appending our layer to the list of lay- 
ers In model. 


# append our layer to the list of layers in model 
model.add(dense_layer) 


Listing 23.32: Appending a new layer to our model. 


Using the two listings above one after the other is perfectly fine. It’s 
clear and it works right. Listing 23.33 shows the sequence. 


dense_layer = Dense(number_of_pixels, activation='relu', 
input_shape=[number_of_pixels] ) 
model.add(dense_layer) 


Listing 23.33: Creating a Dense layer, and adding it to our model, in two 
steps. 


But it’s conventional to create the layer and add it to the model in a 
single line, as in Listing 23.34. This means that the layer doesn’t get 
a variable that holds it, but we rarely need that (Keras does provide a 
mechanism for getting at the layer later, if we really need it). 


model.add(Dense(number_of_pixels, activation='relu', 
input_shape=[number_of_pixels]) ) 


Listing 23.34: A more efficient and common way to create a Dense layer 
and then add it to our model. 
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Now we can add the next layer of our model. This will be another Dense 
layer, but with 10 neurons. 


As we mentioned before, we don’t explicitly tell Keras that this is our 
output layer. We just make it and add it to the growing list of layers. 
When we use the model, Keras will treat it as the output layer simply 
because it’s the last one on the list. 


We create our next Dense layer much like the previous one, but with 
a few changes. In particular, we leave out the input_shape argument, 
since that is only for the very first layer. 


As always, the first argument, which is un-named and mandatory, is 
the number of neurons. Since we’re categorizing our images into 10 
classes, we'll have 10 neurons, one for each class. We'll use the vari- 
able number_of_classes that we saved during pre-processing. 


As discussed in Chapter 17, we often use softmax to process the outputs 

of a final dense layer in a classifier in order to turn them into proba- 
bilities. Let’s do that here. We need only name it as a string, and Keras 

will take care of the rest. 


Using the standard style of creating and appending the layer in one 
step, our next line is shown in Listing 23.35. 


model.add(Dense(number_of_classes, activation='softmax' ) ) 


Listing 23.35: Adding our second Dense layer, which will work as the 
output layer. 


Keep in mind that because this layer is fully-connected to the previous 
layer, each of these 10 nodes receives inputs from all 784 nodes in the 
hidden layer. 


That’s the whole thing. We’ve built a deep-learning model! Listing 
23.36 brings it all together. 
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model = Sequential() 

model.add(Dense(number_of_pixels, activation='relu', 
input_shape=[number_of_pixels])) 

model.add(Dense(number_of_classes, activation='softmax' ) ) 


Listing 23.36: All the code needed to create our deep-learning model. 


That’s all there is to it! Our model is complete! 


We can ask Keras to print out the model in text form. This isn’t terribly 
revealing for our simple example, but it can come in useful for much 
larger models with tens or hundreds of layers. We call the model’s 
summary() method, as in Listing 23.37. This printout lists the layers 
in the order they were placed into the network, so we read it top-down. 
This summary is rather terse, doesn’t include information like the acti- 
vation functions we’ve chosen for each layer. 


modeLl.summary () 


Layer (type) Output Shape Param # 
dense_1 (Dense) (None, 784) 615440 
dense_2 (Dense) (None, 10) 7850 


Total params: 623,290 
Trainable params: 623,290 
Non-trainable params: 0 


Listing 23.37: Our model summary from Keras. 


Keras automatically numbers the layers, such as dense_1 and dense_2 
here. During an interactive session, these numbers will increase over 
time, so if we build our model again and again we'll see something like 
dense_3 and dense_4, and so on. Keras gives every layer it builds a 
unique label so they don’t get mixed up if we build our model over and 
Over 1N a given session. 
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The column labeled “Output Shape” tells us the shape of the tensor 
that comes out of each layer, in the form of a list of dimensions. When 
we see None as an entry here, this is a placeholder for the number of 
samples that are provided as a mini-batch during training. For exam- 
ple, if we have a mini-batch size of 64, then the first layer will process 
64 of our samples in one shot (using the GPU if it can). The output 
will be a list containing 64 rows, each with 784 elements. But since 
right now Keras doesn’t know the size of the mini-batch, it uses None 
to stand for “Not Yet Known.” 


The summary also tells us how many parameters, or weights, are used 

by each layer, and then it adds those up to tell us the total number 

of parameters in the model. We can see that dense_1, the first Dense 

layer, has 784 neurons, each of which reads the value of each of the 784 

inputs. Since each connection has a weight, there are 784x784=614,656 

weights. Each neuron also has a bias term, so adding the 784 bias terms 

to the number we just got gives us the 615,440 in the table. That’s a lot 

of weights! Similarly, the second layer has 10 neurons, each with a con- 
nection to each of the 784 neurons in the previous layer. Remembering 

to add the 10 bias terms, we get (10x784)+10, or 7,850 parameters. 


The final line adds these numbers together, telling us that the com- 
plete model has over 600,000 parameters. 


This is food for thought. Our tiny two-layer model involves well 
over a half-million weights that need to be adjusted on every update 
step. Bigger networks can easily have tens or hundreds of millions of 
parameters. For example, the VGG16 network we used in Chapter 22 
to classify images uses almost 140 million parameters [Lorenzo17]. No 
wonder the efficient backprop algorithm is so popular, and accelerat- 
ing it on a GPU 1s So attractive. 


1162 


Chapter 23: Keras Part 1 


23.6.3 Compiling the Model 


This section’s notebook continues 
Keras-Notebook-02-Making-the-Model.ipynb 


So far, our model is nothing more than a list of specifications. It’s a 
potential model, but it’s no more a real model than blueprints for a 
house are a real house. That house has to be built from the blueprints. 
In our case, we need to turn our description into running code. We call 
this compiling the model. When our model is compiled, it’s ready for 
training. 


The act of compiling turns our layer descriptions into code that will run 
on our computer (and GPU, if available). This is where Keras writes 
programs for us in Theano, TensorFlow, or CNTK. When we train and 
use our model, we'll be using that code. 


To compile the model, we need to give Keras at least two pieces of 
information. 


First, we have to tell Keras how to measure the error for each sample 
(that is, how to put a number to any difference between the network’s 
output and the target we want it to produce). Second, we have to tell 
it which optimizer it should use to update the weights to reduce that 
error. Let’s look at these in turn. 


To measure the quality of the weights we need a loss (or cost) function. 
When we discussed backprop in Chapter 18 we used a simple measure- 
ment of error based on the differences between the output value(s) and 
the label value(s). But there are alternatives. 


Loss functions are interesting to think about, because they give the 
“why” of our network. The neurons, dropout layers, activation func- 
tions, and so on are the “what” of our network, providing the individual 
pieces, like the gears in a mechanical clock. The computation of a result, 
followed by backprop and weight updates, are the “how,” like the way 
the gears of a clock are connected to and propel one another. 
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But the error function tells us why we're doing it all. Is it to find one 
perfect label? Is it to find three equally-likely labels? Is it to predict a 
floating-point value? The name for a face in a photo? The best stock to 
buy tomorrow? A phrase translated from one language to another? Or 
perhaps it’s something more esoteric. 


Every neural network has a purpose, and the loss function in some 
sense defines that purpose, because it drives the whole enterprise. The 
network’s goal is to make the loss, or error, as small as possible. So the 
loss function is driving the whole show. 


Because of their versatility and importance, loss functions can get com- 
plicated in a hurry. And that usually means a lot of mathematics. 


The good news is that most of the basic things that we will be doing 
with deep learning fall into just a few typical applications, and each 
one has a ready-made loss function already programmed into Keras 
for just that job. We need only name the one that was designed for our 
purpose. Since we’re building a multi-category classifier, rather than, 
say, a network to perform regression or binary classification, we'll tell 
Keras to use the pre-built loss function appropriate for a multi-cate- 
gory Classifier. 


That function will compare the one-hot label with the outputs from 
our final layer. This comparison uses the idea of entropy from 
Chapter 6 to determine how close our match is. The name of the 
loss function we want combines these two ideas into the long string 
'categorical_crossentropy'’. 


If we have just two categories, and we’re using one output to decide 
between them (perhaps setting it to a value near oO for one category 
and a value near 1 for the other), the function that evaluates the error 
for that case is named 'binary_crossentropy'. 


There are a bunch of other error functions, all listed in the Keras doc- 
umentation, which are useful when doing regression or a variety of 
other specific tasks. And if the perfect loss function isn’t already there, 
we can write our own in Python and tell Keras to use it instead. 
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Happily, our goal here is basic categorization using multiple outputs, 
so we can use the pre-built 'categorical_crossentropy' loss. That 
tells the network that we want the network’s outputs to match the 
numbers in our one-hot label as closely as possible. 


With the loss function selected, our next job is to pick the optimizer. 
Once the error has been computed, Keras gives it to the optimizer, 
which uses that error to update the weights. We saw a variety of opti- 
mizers in Chapter 19, with names like SGD, RMSprop, and Adagrad. 
Once again, they’re all implemented for us already, so we only need to 
tell Keras which one we want it to use by providing its name. 


There are many other optional pieces of information we can give to 
Keras when we compile our model. One of the most common is to pro- 
vide a list of measurements, called metrics, telling Keras what we’d 
like it to measure as the model learns. We can think of these metrics as 
supplemental error or loss functions, but they’re only computed and 
returned to us as helpful information for understanding and monitor- 
ing the learning process, and are not used to update the model. There 
are many metrics available to choose from. If we don’t see the quan- 
tity we wish to measure, we can create a function to compute a custom 
metric which will be evaluated for us. Though the metrics are always 
a list, we usually provide a list of just one element, requesting it to 
record the accuracy, using the string ‘accuracy’. 


We compile our model by calling our model’s compile() method. This 
builds everything that the model needs to actually run on our com- 
puter with our chosen backend. Because this information is saved 
along with the model object, we don’t have to save anything ourselves. 
When compile() returns, the model is ready to learn. 


Listing 23.38 shows how to call compile() with a loss function, an 
optimizer, and a list of metrics. In this case we’re using the 'cate- 
gorical_crossentropy' loss function, which as we discussed above 
is the appropriate choice for a classification problem with multiple 
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outputs. We’ve picked the 'adam' optimizer, just because it’s usually a 
good place to start, and we’ve specified the common choice of 'accu- 
racy' for the metrics to be measured once we start learning. 


model.compile(loss='categorical_crossentropy', 
optimizer='adam', metrics=['accuracy']) 


Listing 23.38: Compiling a model with its compile() method and our 
arguments. We're choosing the 'categorical_crossentropy' loss 
function and the 'adam' optimizer. Using these strings is a shorthand 
for creating the corresponding objects with their defaults. We're also 
telling it that we'll want it to measure and return the 'accuracy' once 
we start training. 


Our initial choices of the loss function and optimizer are, as usual, 
guided by experience. We pick something we hope is reasonable, see 
how it goes, and then make changes to improve on the performance 
we get. 


If we think we're close but things could be better, we might decide to 
create a custom optimizer and set some of the parameters to some- 
thing other than the defaults. 


For example, the Keras documentation says that Adam’s learning rate 
argument is called lr (a lower-case L and R), and its default value 
is 0.001. Maybe we have a hunch that a smaller starting value could 
improve our results. When we create our optimizer using the string 
'adam', as in Listing 23.38, we're asking for an instance of the Adam 
optimizer with all of its default values. To set some of those values our- 
selves, we make our own instance of an Adam object where we specify 
whatever parameters we want to give values to, leaving all the others 
at their defaults. We then hand that object to compile(), instead of 
giving it a string. Listing 23.39 shows how. 


1166 


Chapter 23: Keras Part 1 


from keras import optimizers 

slow_adam = optimizers.Adam(lr=0.0001) 

model.compile(loss='categorical_crossentropy', 
optimizer=slow_adam, metrics=['accuracy' ]) 


Listing 23.39: Compiling our model using a custom object for the Adam 
optimizer. 


The Keras documentation lists all the optimizers and their instance 
names, their parameters, and all the defaults. 


The loss functions don’t take parameters, so unless we’re using a cus- 
tom function that we wrote ourselves, we usually provide a string 
naming one of the built-in functions. 


We've gone through a lot in this section, but it boils down to the one 
function call of Listing 23.38 (or the more customized version of 
Listing 23.39). Calling compile() with a loss function and optimizer 
gives Keras enough information to convert our network specification 
into real code that we can run. 


23.6.4 Model Creation Summary 


This section’s notebook continues 
Keras-Notebook-03-Model-Creation-Summary.ipynb 


We've just summited our second mountain. 


We started out with how to create a new model. We began by creating 
an empty Sequential object. Then we added a dense, or fully-con- 
nected, hidden layer that also specified the shape of the input layer. 
We finished with another dense layer that produced 10 outputs, one 
for each category. 


Then we compiled our model to turn it from blueprints into reality. We 
told Keras how to measure the loss, how to update the weights, and 
what data we'd like it to measure along the way for us. 
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Putting this all together, Listing 23.40 shows how to create and com- 
pile our model. We’ve merged everything into a little function that 
returns the compiled model. This way our code can contain mul- 
tiple models, and we can pick the one we want just by calling the 
appropriate function. In this summary, we’re assuming that we’ve 
already run Listing 23.27, so the variables number_of_classes and 
number_of_pixels have are available to us (for simplicity, we’re using 
them as global variables, but they could be passed in as parameters). 


from keras.models import Sequential 
from keras.layers import Dense 
def make_one_hidden_layer_model(): 
# create an empty model 
model = Sequential() 


# add a fully-connected hidden layer with #nodes = #pixels 
model.add(Dense(number_of_pixels, activation='relu', 
input_shape=[number_of_pixels]) ) 


# add an output layer with softmax activation 
model.add(Dense(number_of_classes, activation='softmax' ) ) 


# compile the model to turn it from specification to code 
modeLl.compile(loss='categorical_crossentropy', 
optimizer='adam', 
metrics=['accuracy']) 
return model 


model = make_one_hidden_layer_model() # make the model 


Listing 23.40: Summarizing how to create and compile our first network. 


Combining the data-loading and pre-processing steps in Listing 23.27 
with the model creation steps in in Listing 23.40 takes us from a blank 
slate to a model that’s ready to learn. 


Building our model took only three lines of code. Compiling it took 
only one. And now we'll see that training the system also takes only one 
line. But as we’ve seen, each of these lines packs in a lot of information. 
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Now we're ready to hand our prepared data to our compiled model 
and start learning. 


Let’s start training! 


23.7 Training The Model 


Now that our data is set up for learning, and we have a model defined 
and compiled, it’s time to give the data to the model and let it learn. 


This is where a library like Keras really shines. All the machine-learning 
work of managing the data flow, calculating gradients with backprop, 
applying weight update formulas, and the rest, is all handled for us. 


In a nod to scikit-learn, where we used a routine named fit() to train 
our objects, the Keras training routine is also named fit(). This one 
function call takes our data and model, and runs the entire learning 
process for us, soup to nuts. We just call it and go get a cup of cof- 
fee, or sleep overnight, visit friends for the weekend, or take a vacation 
for a few weeks, depending on our network, data, and the computing 
resources available. For our little 2-layer model of Listing 23.40, run- 
ning on MNIST, a quick break is all that’s called for. It takes about 2-3 
seconds per epoch on a 2014 iMac, running the TensorFlow backend 
without a GPU. We'll see that we get good results after 20 epochs, so 
that’s less than a minute. 


To keep an eye on the learning process, we can ask fit() to print out 
intermediate progress after each epoch. This lets us see if things are 
going well, and potentially interrupt the process if the network isn’t 
learning. If we do let it run to completion, fit() returns an object of 
type History. This contains all the data that Keras measured after each 
epoch, such as the model’s accuracy and loss. We can use that history 
to make plots and graphs to visualize the system’s performance. 
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The terminology used by the documentation describing fit () deserves 
a moment’s attention. 


The training data is now simply called x and y, though since they’re the 
first two arguments we don’t have to explicitly provide those names. 
The data that’s then used to evaluate the system is called the vali- 
dation data, not the test data. The reason for this is that Keras will 
evaluate our model after each epoch. Thus we hold the test data aside, 
to evaluate the final model just before deployment. We use the valida- 
tion set for measuring performance while training. 


With the terminology in place, let’s look at how we call fit(). 


The first two arguments, which are both mandatory, are the training 
samples and the training labels, in that order. As we just saw, they’re 
called x and y, though following Python convention, these mandatory 
first arguments are usually not explicitly named when we call fit(). 


During training, fit() will periodically evaluate the model using the 
validation data. We can choose to either provide that data explicitly, or 
we can tell fit() to extract a validation set from the input data. 


If we have our own validation data (as we do with MNIST), we provide 
the validation samples and their labels in a little 2-element list as the 
value of the optional argument validation_data. 


If we don’t have our own validation set, fit() can make one for us if 
we give it a value for the argument validation_split, in the form of 
a floating-point number from 0 to 1, telling it what percentage of the 
training data to use as validation data. This is like using scikit-learn’s 
train_test_split() routine, but on the fly. Generally speaking, it’s 
better to provide our own validation data, since we have more control 
over what it contains. 


As we saw in Chapter 8, we typically train models in mini-batches. 
Since we rarely train with the full batch at one time, many people refer 
to mini-batches as simply “batches.” Keras does this as well, using 
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parameter names like batch_size for what is more properly a “mini- 
batch size.” Since using “batch” for “mini-batch” is so common, we'll 
use that language here as well. 


When learning in batches, fit() will pull off a batch-sized chunk of 
samples from our training set, learn from it, update the weights, and 
then take another chunk. It’s our job to tell fit() how big those 
chunks should be with the optional argument batch_size. This argu- 
ment defaults to 32, but we can set it to any value we like. If we’re 
using a GPU, we typically set this to a power of 2 (like 32 or 128) that 
makes our data fit best into the GPU we're using, so it can process an 
entire batch in one parallelized operation. When we're training on a 
CPU only, we often use a larger batch size, perhaps even a few hun- 
dred samples, since our computer has more memory available. 


In this chapter we'll be demonstrating results without a GPU, so we'll 
usually use a pretty big batch size like 256. 


Another important argument is how many epochs the training should 
run for. Recall that one epoch means one complete pass through the 
training set (taken in batches, as above). That is almost never enough 
to train the system fully, so then the system runs through all the data 
again, for another epoch, repeating the process over and over. The 
downside of telling fit() how many epochs to use before we’ve even 
started training is that we could be wildly off. Maybe we need far more 
epochs than we ask for, so we end up stopping training too soon, or we 
pick a number far larger than we need, wasting a lot of time training a 
network that’s no longer learning (or worse, overfitting). We will see 
cures for both problems later. For now, we'll just pick a number and 
hope that it’s about right. The name of the argument is epochs, short 
for “number of epochs.” We'll pick 3 just to make sure everything’s 
working right, and then crank that number up later. 


The last argument we'll use is verbose, which tells the system how 
chatty to be after each epoch (grammatically, we might prefer “verbos- 
ity” for the name of this argument, but verbose it is). If we set this to o 
it doesn’t print out anything. A value of 1 prints an animated progress 
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bar that shows the system chugging its way through the samples in 
each epoch. A value of 2 just shows a single summary line of text after 
each epoch. 


Let’s train our model with our own validation data, for 3 epochs. Since 
were training on a CPU, we'll use a large batch size of 256 samples per 
batch. This will give us smoother graphs when we plot our data, com- 
pared to the results for the smaller batch sizes we’d usually use if we 
were training on a GPU. We'll set verbose to 2 so that we'll get a line 
of information after each epoch. Listing 23.41 shows the one line that 
does it all. 


# call fit() to train the model, and save the history 
history = model. fit(X_train, y_train, validation_data= 
(X_test, y_test), 

epochs=3, batch_size=256, verbose=2) 


Listing 23.41: Finally, we're training our model! 


When we enter this line, the system will start to train. With only 3 
epochs, this should run in well under a minute on most any modern 
computer. 


This is the top of the third mountain! Let’s put it all together. 


23.8 Training and Using A Model 


This section’s notebook is 
Keras-Notebook-04-Train-and-Run.ipynb 


We’ve reached the peak of the final mountain. Starting from scratch, 
weve got a (barely) trained neural network for classifying MNIST 
digits. 


This is a good time to pause, enjoy the view, and look back on how far 
we ve come. 
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Listing 23.42 combines the pre-processing of Listing 23.27, the model 
building of Listing 23.40, and the training of Listing 23.41 into one 
place. 


In a scant 53 lines, including comments and blank spaces, this code 
starts with nothing, gets our data, pre-processes it, builds and com- 
piles a deep-learning model, and then trains it for 3 epochs. 


from keras.datasets import mnist 

from keras.models import Sequential 

from keras.layers import Dense 

from keras import backend as keras_backend 

from keras.utils.np_utils import to_categorical 
from keras.utils import np_utils 

import numpy as np 

random_seed = 42 

np.random.seed(random_seed) 


# load MNIST data and save sizes 

(X_train, y_train), (X_test, y_test) = mnist.load_data() 
image_height = X_train.shape[1] 

image_width = X_train.shape[2] 

number_of_pixels = image_height * image_width 


# convert to floating-point 
X_train = keras_backend.cast_to_floatx(X_train) 
X_test = keras_backend.cast_to_floatx(X_test) 


# scale data to range [0, 1] 
X_train /= 255.0 
X_test /= 255.0 


# save the original y_train and y_test 
original_y_train = y_train 
original_y_test = y_test 


# replace label data with one-hot encoded versions 
number_of_classes = 1 + max(np.append(y_train, y_test) ) 

y_train = to_categorical(y_train, num_classes=number_of_classes) 
y_test = to_categorical(y_test, num_classes=number_of_classes) 
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# reshape samples to 2D grid, one line per image 
X_train = X_train.reshape([X_train.shape[0], number_of_pixels]) 
X_test = X_test.reshape([X_test.shape[0], number_of_pixels] ) 


def make_one_hidden_layer_model(): 
model = Sequential() 
model.add(Dense(number_of_pixels, activation='relu', 
input_shape=[number_of_pixels])) 
model.add(Dense(number_of_classes, activation='softmax' ) ) 
model.compi le(loss='categorical_crossentropy' , 
optimizer='adam', 
metrics=['accuracy' ]) 
return model 


# make the model 
one_hidden_layer_model = make_one_hidden_layer_model() 


# call fit() to train the model, and save the history 
one_hidden_layer_history = one_hidden_layer_modelL. fit( 
X_train, y_train, 
validation_data=(X_test, y_test), epochs=3, 
batch_size=256, verbose=2) 


Listing 23.42: Combining the pre-processing of Listing 23.27, the model 
building of Listing 23.40, and the model training of Listing 23.41 into one 
place. 


23.8.1 Looking at the Output 


The final line of Listing 23.42 actually trains our model, learning from 
the training data in X_train. Listing 23.43 shows the summaries we 
asked it to print after each epoch. Running this code again may pro- 
duce slightly different numbers. 
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Train on 60000 samples, validate on 10000 samples 


Epoch 1/3 
3s-loss: 0.3053 - acc: 0.9141 - val_loss: 0.1641 - val_acc: 0.9532 
Epoch 2/3 
3s-loss: 0.1237 - acc: 0.9645 - val_loss: 0.1044 - val_acc: 0.9701 
Epoch 3/3 


3s-loss: 0.0812 - acc: 0.9763 - val_loss: 0.0820 - val_acc: 0.9748 


Listing 23.43: The output of Listing 23.42. Note that different backends 
may produce slightly different values. We removed the least significant 
digit in each number to make the lines fit. 


The first line is for reassurance, reporting the sizes of the training and 
testing sets. This helps us catch situations where we accidentally mix 
up these two data sets. 


The system prints out Epoch 1/3 when it starts the first epoch, and 
prints the summary line when it has run through every sample in the 
epoch. The first piece of information is the time consumed. Here, it 
took about 3 seconds (again, on a CPU only) to train our simple model 
on every sample in the training set (that is, one epoch). The system 
then prints out the loss and accuracy (loss and acc) for the training 
set. Unfortunately, these are not explicitly labeled as being for the 
training set. But we can see that the next two results are for the vali- 
dation set (val_loss and val_acc), so that helps remind us that the 
unlabeled versions are for the training data. 


How'd we do? At a glance, the results for the training data look prom- 
ising. The test loss is dropping after each epoch, and the test accuracy 
is improving. This suggests that we have everything wired up sensibly, 
and that the system is learning. 


A moment of exuberance would not be out of place. 


The validation data also looks good. Again, the loss is dropping every 
epoch, and the accuracy is climbing. After just 3 epochs of training, it’s 
already up to over 97% accuracy! That’s not nearly as good as the best 
scores anyone’s found [LeCun13], but it’s pretty amazing that with a 
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tiny network containing just one hidden layer, after only 3 epochs of 
learning, and no tuning at all, we are recognizing handwritten digits 
correctly 97.5% of the time! 


Training for 3 epochs isn’t really enough for almost any network, even 
one this simple. Let’s run this for 20 epochs and see how it does. All 
we have to do is change the argument to epochs to 20 and let it go. 


history = model. fit(X_train, y_train, 
validation_data=(X_test, y_test), 
epochs=20, batch_size=256, verbose=2) 


Listing 23.44: Finally, we're training our model for real! We're giving it 20 
epochs to learn. 


The first few and last few lines from the output of Listing 23.44 are 
shown Listing 23.45. 


Train on 60000 samples, validate on 10000 samples 

Epoch 1/20 

3s-loss: 0.3044-acc: 0.9141-val_loss: 0.1643-val_acc: 0.9522 
Epoch 2/20 

3s-loss: 0.1239-acc: 0.9646-val_loss: 0.1037-val_acc: 0.9699 
Epoch 3/20 

3s-loss: 0.0813-acc: 0.9760-val_loss: 0.0816-val_acc: 0.9753 
Epoch 4/20 

3s-loss: 0.0578-acc: 0.9834-val_loss: 0.0723-val_acc: 0.9788 


Epoch 17/20 

3s-loss: 0.0015-acc: 1.0000-val_loss: 0.0639-val_acc: 0.9818 
Epoch 18/20 

3s-loss: 0.0011l-acc: 1.0000-val_loss: 0.0628-val_acc: 0.9826 
Epoch 19/20 

3s-loss: 9.0475e-04-acc: 1.0000-val_loss: 0.0633-val_acc: 0.9827 
Epoch 20/20 

3s-loss: 8.6833e-04-acc: 1.0000-val_loss: 0.0619-val_acc: 0.9826 


Listing 23.45: The start and of the output from Listing 23.44. 
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The output in Listing 23.45 looks fantastic. Our score on the training 
set is 100% accuracy. That’s perfection! 


The testing set score is not perfect, but it’s very respectable for such a 
simple model. Our model is misclassifying only 174 out of all 10,000 
test samples. 


As we mentioned earlier, one of the nice things about using image 
data is that we can look at it. Let’s look at some examples that were 
misclassified. 


In Figure 23.25, each row shows images whose given label is that row’s 
number. That is, every image in the top row was originally assigned 
the label o in the data set, every image in the second row was origi- 
nally assigned the label 1, and so on. But here we’re showing images 
that were classified incorrectly by our network. The column shows the 
label that the system assigned to that image. For example, in the top 
row, there’s a picture in the fifth position. It’s in the top row, so the 
MNIST data tells us that this should be a 0, but it’s in the fifth column, 
so our system predicted that this was a 4. That seems like a pretty odd 
mistake. 
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Figure 23.25: A visualization of the test-set images whose predicted 
value did not match their label. The top row contains images that were 





1 









































TBletite 


















































labeled O in the original MNIST data. The next row contains images that 
were labeled 1 in the original data, and so on. Each column tells us what 
label our network assigned. For example, the top row shows images that 
should all have been labeled O. But there’s an image in the 5th column of 


this row, meaning that the system classified it as a 4. Empty boxes mean 
no images fell into that position. The image shown in each position is 


randomly selected from all the images that belong to that cell. 
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A more sensible mistake can be seen in the third row. The second image 
from the left was labeled a 2 in the data, but our system categorized it 
as a 1. It’s hard to tell what it should be. In the sixth row, the first entry 
was labeled a 5 in the data, but our system called it a 0. That doesn’t 
seem unreasonable. 


White some of these errors seem surprising (the leftmost 8 in Figure 
23.25 seems pretty clearly an 8 and not a O), it’s important to keep in 
mind that these errors are rare. Out of 10,000 test images, the system 
only disagreed with the given labels 174 times. 


Figure 23.26 shows a “heat map” of our errors, where each cell tells us 
how many images fell into that cell. The range runs from black (for no 
images at that cell) through reds and then yellows to white. 
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Figure 23.26: A heat map of the population of the error visualization 
of Figure 23.25, telling us how many images landed in each of the cells. 
Black indicates an empty list, while brighter reds, then yellows, and finally 
white represent longer lists. 
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The white box in the 9th position of the 5th row shows that on this 
training run, the biggest mistake the system consistently made was 
with digits that had been labeled as a 4, but which the system identi- 
fied as a 9. Out of the 10,000 digits in the validation set, 9 images were 
misclassified this way. Other common mistakes were digits labeled as 
5 being called 3, and digits labeled 7 being classified as 3. 
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23.8.2 Prediction 


This section’s notebook is 
Keras-Notebook-05-MNIST-Photo-Prediction.ipynb 


Our validation data is getting some impressive numbers, but what if 
we looked at some digits that weren’t made by Census Bureau workers 
or high-school students? 


Let’s take the model we just trained, and deploy it. We'll give it some 
new images that it’s never seen before, and see how it does. 


Figure 23.27 shows four photographs taken on a winter’s day in the 
Seattle area. We have a sign in a coffee-shop window, some spray- 
painted marks on the ground near a construction site, a number 
painted onto the side of a dumpster, and a parking-lot stall number. 
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Figure 23.27: Four photos from the Seattle area. Left to right: a sign ina 
coffee shop window, spray-painted marks on the ground near a construc- 
tion site, a building number on the side of a dumpster, and a 3 digit parking 
stall number. 


The stenciled numbers in the parking lot stall are not hand-drawn, 
and they have gaps, so they’re really not appropriate for our system. 
They’re included just for fun, and to see what our deep-learning sys- 
tem comes up with. 


When we extract these digits, rotate them to be upright, and prepare 
them in the same way as the original MNIST data [LeCun13], we get 
Figure 23.28. 
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Figure 23.28: Extraction of the digits from Figure 23.27. Each digit has 
been rotated to be upright and then processed like the original MNIST 
images. 





How well will our system do? It’s important to emphasize before we 
get into this that this is not a fair test. The validation set has 10,000 
images for a good reason, but here we’ve only got 18 images. This is 
far too small a sample set to have any statistical validity. Even worse, 
the parking lot images are not hand-drawn, and they each have prom- 
inent gaps. So all we’re going to get here is some anecdotal evidence, 
rather than anything we can use to reliably characterize our system’s 
performance. That, after all, is exactly what the validation set is for. 
Nevertheless, they make a fun and interesting test, and along the way 
we'll see how to ask our model for predictions, so let’s dig in. 


Just to clarify which set we’re working with, let’s make four test sets, 
one for each group of images. For instance, we'll arrange the coffee 
shop data into a grid that has 4 rows and 784 columns, as in Figure 


23,20. 
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Figure 23.29: Arranging the four images of our coffee shop data into a 
2D grid, one row per image. 


The construction data has 7 rows, the dumpster data has 4 rows, and 
the parking lot data has 3 rows. 


As always, we need to pre-process our data. We’ve already got it into 
a 28 by 28 shape, but that’s not enough. Just like the MNIST data, we 
need to convert the input pixels into the current Keras floating-point 
form, and then we must apply the same pre-processing that we 
applied to the training data we used to train our model. So we'll use 
cast_to_floatx() again to get the data into the right type, and then 
we'll divide every pixel by 255, just as before. The processing step for 
the coffee shop data is shown in Listing 23.46. 


CoffeeShopDigits_set = keras_backend.cast_to_floatx ( 
CoffeeShopDigits_set) 
CoffeeShopDigits_set /= 255.0 


Listing 23.46: Pre-processing of the coffee shop images. We set them to 
the current floating-point type, and then use the same pre-processing we 
used when training. Here, that’s just dividing the values by 255. 


Now were ready to give these images to the model and ask it to iden- 


tify, or predict, each digit. We’re testing our deep-learning system on 
new data! 
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There are two types of predictions we can ask for. The simplest one just 
gives us the highest-probability predicted class for each sample. We 
get this by calling a method called predict_classes() which is pro- 
vided by our compiled model. Its first, and only mandatory, argument 
is the data we want it to predict. We'll set verbose to 0 because we 
don’t need a progress bar for this fast little task. Listing 23.47 shows 
the input and output. 


model.predict_classes(CoffeeShopDigits_set, verbose=0) 
array([4, 5, 6, 8]) 


Listing 23.47: We give our model the new data in the proper shape, and 
ask for the final classes with predict_classes().The result is an array 
with one entry for each class. 


The result of the prediction is an array of integers, one for each line in 
the input, telling us what class the system has assigned to that line. 


Comparing the results of Listing 23.47 with Figure 23.28, we can see 
that the system did perfectly! It correctly classified all four digits. 


The other type of prediction we can ask for will give us the probability 
of each class. This way we can see if there were any close runner-ups, 
and generally get a feeling for how well the system did at not just a 
finding the right answer, but discarding the incorrect answers. We get 
back this list using the predict_proba() method. Listing 23.48 shows 
the result of this function, providing us with the probabilities for each 
class of the coffee-shop sign’s number 4. 


1183 


Chapter 23: Keras Part 1 


coffee_probas = model.predict_proba(CoffeeShopDigits_set, 
verbose=0) 
coffee_probas[0] 
array([ 1.14440860e-13, 1.44225864e-11, 2.28489186e-10, 
5.18200795e-11, 7.40086734e-01,  1.52976892e-04, 
1.60120806e-10,  1.13515500e-06, 1.25018749e-04, 
2.59634137e-01], dtype=float32) 


Listing 23.48: The probabilities for our coffee shop sign’s 4. We first ask 
for the probabilities for all the images in the set, which gives us back a list 
of length 4, each element a list of 10 probabilities for that image. Here we 
print just the first of those lists. The entry at index 4 is the largest, but 
it’s interesting that the system thought there was a pretty good chance 
the digit was a 9. 


We can see that we get back one floating-point number for each class. 
In this example, the largest values were for 4 and 9, which makes sense 
looking at the image. The 4 won out by a factor of 10. 


Let’s plot all the probabilities for the four digits in the coffee shop data. 
The results are in Figure 23.30. Only the 4 had any appreciable com- 
petition, with there being around a 26% chance that the image was a 9. 
The system was essentially certain about the other three digits. 
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Figure 23.30: Plots of the probabilities from Listing 23.48. The system 
was pretty sure the first digit was a 4, but thought it might also be a 9 (or 
just possibly an 8). It was very sure of the other digits. 


Let’s test our model on the other three sets. 


Listing 23.49 shows the input and output for the construction data. 


model.predict_classes(ConstructionDigits_set, verbose=0) 


array ( [0,25 3. 95 5. 3, 91) 


Listing 23.49: Our model’s predictions for the construction data. 


The system got most of the digits right, but misclassified the 4 as a 9, 
and the 7 as a 3. The 7 seems pretty reasonable, since we could inter- 
pret that cross-bar as splitting the figure into an upper and lower curve, 
sort-of like a 3. Mistaking the 4 for an 9 seems harder to rationalize. 


Listing 23.50 shows the input and output for the dumpster data. 


1185 


Chapter 23: Keras Part 1 


model.predict_classes(DumpsterDigits_set, verbose=0) 
array([1, 3, 4, 5]) 


Listing 23.50: Our model’s predictions for the dumpster data. 


Perfect! Listing 23.51 shows the input and output for the parking lot 
data. 


model.predict_classes(StencilDigits_set, verbose=0) 
array([2, 3, 5]) 


Listing 23.51: Our model’s predictions for the parking lot data. 


Perfect again! The parking-lot data is almost ridiculously unfair. The 
digits are not hand-drawn, and they all have multiple gaps. This isn’t 
the sort of thing the model was trained on at all. There’s no reason to 
think it would interpret these images well. Yet it nailed all three digits. 


Our tiny model with just two layers, and just 20 epochs of training, did 
a great job, correctly classifying 16 out of 18 of our images. 


23.8.3 Analysis of Training History 


This section’s notebook is 
Keras-Notebook-06-MNIST-Training-History.ipynb 


Our system seems to be doing pretty well, particularly for something 
so simple. 


We mentioned before that fit() returned some history information 
that tells us how the training went. Let’s investigate that now and see 
what we can learn. 


To gather lots of data, this time we'll train for 100 epochs, even though 
we know the system hits 100% on the training data after just 20 epochs. 


The history information is returned by fit(), So we can just assign the 
output of that method to a variable, as in Listing 23.52. 
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one_hidden_layer_history = model. fit(X_train, y_train, 
validation_data=(X_test, y_test), 
epochs=100, batch_size=256, verbose=2) 


Listing 23.52: Saving the history returned by fit(). 


Here were saving the history in a_ variable named 
one_hidden_layer_history. This contains a bunch of fields that 
summarize the training process (like how many epochs it ran for, and 
what parameters we used). The field that’s most interesting to us right 
now is called history. It’s a Python dictionary object that contains the 
accuracy and loss values for both the training and validation sets after 
each epoch. 


The training accuracies are in this dictionary as a list stored under the 

key 'acc', so we'd get them from one_hidden_layer_history with 

one_hidden_layer_history.history['acc'] (that’s a lot of typing!). 
The training loss uses the key 'loss'. Similarly, the validation accu- 
racy and loss are stored with the keys 'val_acc' and 'val_loss'. 


Note that as in many other places in Keras, information related to the 
training set has no prefix, so those lists use the keys 'acc' and 'loss'’. 
Information related to other data sets is prefixed by a descriptor, so 
here we have validation data saved with the keys 'val_acc' and 
'val_loss'. 


Using the list of numbers retrieved by using each of these keys, Figure 
23.31 plots the accuracy and loss of our training data graphically. 
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Figure 23.31: The accuracy and loss of our one-layer network plotted 
against the number of epochs. 


There are some surprises here. 


First, it’s worth noting the scales of the data. The accuracy graph begins 
at about 0.91 (and tops out at 1.0). That means that after just one epoch 
of training, our system was up to 91% accuracy. That’s far from perfect, 
but it’s pretty amazing for such a tiny network and one epoch of train- 
ing. The loss plot has a correspondingly small range, from Oo to just 0.3. 


Both graphs show some spikes. This is probably due to a time when the 
samples arrived in just the right order so that some systematic errors 
were able to accumulate. The system righted itself nearly immediately 
in both cases. 


The training loss quickly drops to 0 by about the 20th epoch, and except 
for the spikes, it stays there. But the validation loss is slowly increas- 
ing. In other words, the training loss and validation loss are diverging. 
This is a picture of overfitting. As we discussed in Chapter 9, overfit- 
ting means that the system has learned how to identify the training set 
by honing in on its idiosyncrasies, not its general principles. 


Learning during overfitting is actually reducing our performance on 
the validation data, as the system fruitlessly learns more and more 
about the training set, sharpening its rules and memorizing details. 
This is a complete waste of effort, and it comes at the expense of losing 
generality, with each epoch causing even more harm to the network’s 
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accuracy on new data. Though it doesn’t look like the accuracy is drop- 
ping in these graphs, the increasing validation loss suggests that that 
time may come, if we kept on training. 


To prevent this overfitting, we might be tempted to stop training where 
the loss or accuracy curves cross one another, but this would be too 
early. The validation accuracy is still improving, and the validation 
loss is still generally dropping. The best place to stop would be when 
our validation loss or accuracy stop improving. That is, when the loss 
starts to increase or the accuracy starts to drop. Of these two choices, 
we usually use increasing loss on the validation set as our trigger to 
stop training. 


Below we'll see how to detect that situation automatically, and stop 
training at that point. This will help us avoid overtraining our model. 
It will also solve our problem of having to guess the right number of 
epochs to train for, and hope we don’t guess either too high or too low. 
We'll just pick a huge number, and let the system stop itself when it 
starts to overfit. 


Before we get into that, let’s see how to save and load our hard-won 
trained networks to a file. After all, if we've spent hours (or days or 
weeks) training a model, we’d surely like to be able to save all of those 
precious weights. Then the next time we want to use that model we 
can just load the weights from a file and our fully-trained model will be 
ready to go, and we won't have to teach it all over again from scratch. 
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23.9 Saving and Loading 


This section’s notebook is 
Keras-Notebook-07-Save-and-Load.ipynb 


After we've gone to all the trouble of training a model, we certainly 
want to save it so we can use it again later. There are several options. 


23.9.1 Saving Everything in One File 


The easiest way to save our model and weights is to call a built-in 
method belonging to our object that tells it to write itself to a file. The 
method is, sensibly enough, called save(). When we call this method, 
the model will write a file that contains both its architecture and 
weights. 


The model is saved in a format called HDF5, which conventionally 
uses the extensions .h5 or .hdf5 [HDF517]. We can save our model 
with just one line, as in Listing 23.53. 


model.save('my_model.h5' ) 


Listing 23.53: Saving a complete version of our model to a file. 


Later, we can read this file back in with the load_model() function. 
Unlike save(), we need to import a new Keras module to access load_ 
model(). That’s because when we load a model, we might not yet have 
an object whose methods we can call. 


Suppose we want to load the model that we saved in Listing 23.53. We 
can do it with Listing 23.54. 


from keras.models import load_model 
model = lLoad_model('my_model.h5' ) 


Listing 23.54: Loading the complete version of our model from a file. 
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Just like that, the model variable now contains a complete version of 
the model we saved, with all the weights we’d learned as of the time 
the file was written. 


We can now use that model to predict new results. Because Keras also 
saves the state of the optimizer in the file, if we want to train the model 
some more, we can just pick up training from where we left off. 


23.9.2 Saving Just the Weights 


If we only want to save the weights (probably to save a little space on 
our hard drive), the method save_weights() will do the job, as in 
Listing 23.55. 


model.save_weights('my_model_weights.h5' ) 


Listing 23.55: Saving just the weights to a file. 


If we want to use these weights later, then we have to first build a 
model to receive them. The most common case is when our model has 
the same architecture as the model we used to save the weights. Then 
the weights just pour right back in to where they had been, as shown 
in Listing 23.56. 


# create a model just like the one we saved the weights from 
model = make_model() # a pretend function to make our model 


# now read the weights back froma file and fill up the model 
model. Load_weights('my_model_weights.h5' ) 


Listing 23.56: Loading the weights only. 
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23.9.3 Saving Just the Architecture 


Saving both the model and its weights is the most convenient way to 
save our work, since we have everything we need in one place. Saving 
just the weights is useful if we want to share our trained model with 
people using different libraries that aren’t set up to read the Keras 
architecture information. 


Much less frequently we'll want to save just the architecture without 
the weights. 


If we need to save just the architecture of the model, Keras supports 
two different formats: JSON [JSON13] and YAML [YAML11]. These 
formats are both designed to save data structures to text-only files. 
YAML is a superset of JSON, meaning that it can do everything JSON 
can do and more, but if we’re just saving and loading a model architec- 
ture that extra power is moot. Since both standards are text-based, so 
it’s easy to open and read files in either format with a text editor if we 
want. 


The technique for saving an architecture in both cases is to use Keras 
to convert the model into a big character string, and then write that 
string to a file. 


To get the architecture back, we read the string from the file, and then 
use Keras to turn the string into a model. 


To turn a model into a YAML string, we use the to_yaml() method 
that is part of the model. Then we can write that to a file, as in Listing 


Do 57. 


import yaml 

filename = 'my_model_arch.yaml' 

yaml_string = model.to_yaml() 

with open(filename, ‘w') as outfile: 
yamL.dump(yaml_string, outfile) 


Listing 23.57: Saving our model architecture, without weights, as a YAML 
file. 
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To read our architecture back, we can use Listing 23.58. 


import yaml 

from keras.models import model_from_yaml 

filename = 'my_model_arch.yaml' 

with open(filename) as yaml_data: 
yaml_string = yaml.load(yaml_data) 

model = model_from_yaml(yaml_string) 


Listing 23.58: Reading our model architecture, without weights, from a 
YAML file. 


To use JSON rather than YAML, we need only replace all occurrences 
of yaml with json in Listing 23.57 and Listing 23.58. 


23.9.4 Using Pre-Trained Models 


The ability to save and load our models is useful when we’re develop- 
ing and testing models of our own. But it also allows us to build on the 
work of others. 


Some deep learning models can have dozens of layers, and may have 
been trained for days or weeks on mountains of data that we don’t 
have access to. But if the authors of the model have released the struc- 
ture and weights, then we can instantly use their model and all the 
hard work that went into it. That’s just what we did when we used the 
VGG16 model in Chapter 20 and Chapter 21. 


We often fine-tune these pre-trained models by training them on 
our own data, helping them specialize on the tasks we need to do. This 
is sometimes called transfer learning [Karpathy16]. 


We might even modify the architecture, such as by adding a few layers 
of or own to the end of the pre-trained model. We “protect” the exist- 
ing model by telling Keras not to change their weights during training. 
We say that such layers are frozen. This means that only our new lay- 
ers get updated weights as we train. 
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To freeze a layer, we set the layer’s optional parameter trainable to 
False. We can later “thaw” a frozen layer by setting this parameter to 
True and compiling it again. 


An alternative to adding more layers to the end of a model is to freeze 
all but the last few layers. We typically then train the model with our 
new data with a very small learning rate. The idea is that we’re just 
tweaking, or fine-tuning, the weights that came with those layers so 
that they’re more amenable to our data [Gupta17]. 


A list of pre-trained models in Keras can be found in the documenta- 
tion at https://keras.io/applications/. 


23.9.5 Saving the Pre-Processing Steps 


We've seen how to save the architecture, the weights, and both com- 
bined. But as we know, any time we use a model we must pre-process 
our new data in the exact same way that the training data was processed. 


For example, in our pre-processing of MNIST data in Listing 23.27, we 
divided all of our pixel data by 255. In the VGG16 model we used in 
Chapter 21, the color images used as samples must be pre-processed 
by subtracting a specific number from every channel of every pixel 
[Lorenzo17]. 


The key point is that in order to properly use a saved network, we want 
to also save and load the data pre-processing steps, so that we can 
apply them to new data. 


Unfortunately, as of Keras 2, there’s no standard way to do this. Part 
of the problem is that we can do the pre-processing any way we like. 
We might use a library function, or a function of our own, or we could 
just explicitly modify the data, the way we did when we divided it by 
255. Without some kind of standard, there’s no way to capture those 
kinds of operations. 
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The general solution is to document our pre-processing steps as well 
as we can. That usually means writing comments into the code or ina 
text file, and then try to make sure that the description stays with the 
model somehow. We also have to figure out how to alert people that 
it’s there, and encourage them to read it. 


It’s a messy situation. 


But it’s a situation that we must address somehow, because we need 
to apply the same pre-processing that we used on our training data 
to any new data. Unfortunately, at the moment we must manage the 
documentation and implementation of sample pre-processing on a 
case-by-case basis. 


The important thing to remember is that whether we're sharing a 
trained model of our own, or using someone else’s, we need documen- 
tation on how the training data was pre-processed. As authors, it’s our 
job to write that documentation and make it available in some rea- 
sonable format. As adopters, it’s our job to find that information and 
follow it when preparing our own data. 


23.10 Callbacks 


This section’s notebook is 
Keras-Notebook-08-Callbacks.ipynb 


Now that we can save our models, let’s get back to the issue of bringing 
our training to a halt when the validation loss starts to climb and we 
begin to overfit. 


Recall that the fit() function runs the data through one batch at a 
time, for epoch after epoch. 


After each epoch it computes values such as loss and accuracy, as well 
as the values we asked for in the metrics argument. It also consults 
a list of callback procedures that we supply. Keras then calls each of 
those procedures for us, and they can do anything we want. 
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We tell Keras what functions to call by handing them to fit() as the 
value of an optional argument called callbacks. These callbacks can 
be a combination of functions we’ve written ourselves, and functions 
built into Keras. 


In this section, we'll focus on three of the callbacks provided to us by 
Keras: one to checkpoint (or save the weights), one to control the 
learning rate over time, and one to perform early stopping (or 
cease training when we appear to start overfitting). 


23.10.1 Checkpoints 


A popular use for callbacks is to checkpoint our model during train- 
ing. This means saving out the model (or, if we prefer, just the weights) 
to a file. We can save a checkpoint after every epoch if we like, but usu- 
ally we only do this after every few epochs. 


Having checkpoints means that if we’re training a system that takes 
hours or days, and we lose power or for any other reason the training 
stops, we can pick up again by loading the most recently saved model 
file. 


To tell Keras to make checkpoints we'll create a ModelCheckpoint 
object, and then hand it to Keras when we call fit(). 


The first argument to ModelCheckpoint, which is mandatory and 
unnamed, is the path to the file that will be written. This file is in the 
HDF5 format, so we typically give it an extension of either .h5 or .hdf5. 


This filename is special, because it can include Python string-format- 
ting instructions that include values for variables that Keras knows 
about. It always keeps track of the epoch, so a string like {epoch: 03d} 
means that the braces and everything between them will be replaced 
by a 3-digit decimal number holding the current epoch. 


We can tell Keras to include one other value of our choice. By default, 
that value is val_loss, or the loss on the validation set. So to include 
that value in the name of the output file, we can use a string like 
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{val_loss:0.3f}. In this case the fragment will be replaced with a 
3-digit floating-point value of the current loss (when that value is less 
than 1, Python inserts a ©. at the start for us). 


A typical filename is in Listing 23.59. Here we're placing our files into 
a pre-existing folder named SavedModels. 


filename = 'SavedModels/weights-{epoch:02d}-{val_Lloss: .03f}.h5' 


Listing 23.59: A filename that Keras will use for checkpointing. It will 
include the epoch number and validation loss with the given format when 
the file is created. 


This will create checkpoints with names like those in Listing 23.60. 


weights-epoch-000-val_loss-0.156.h5 
weights-epoch-001-val_loss-0.102.h5 
weights-epoch-002-val_loss-0.080.h5 
weights-epoch-003-val_loss-0.072.h5 


Listing 23.60: Names of the first few checkpoint files written out using 
the file name of Listing 23.59. 


To get Keras to make these files, we need to create the function that 
builds and saves them. We do this by making an instance of the built-in 
ModelCheckpoint object. It takes one mandatory argument providing 
the filename, formatted as we just saw. Listing 23.61 shows how we 
build this object, leaving all of its other options at their default values. 


checkpointer = ModelCheckpoint (filename) 


Listing 23.61: Creating an instance of ModelCheckpoint with our 
desired filename. 


The only thing left is to provide this object to Keras when it trains 


the model. In our call to fit(), we include the optional argument 
callbacks, which expects a list of callback objects. Since we just have 
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this one, we'll wrap it up in square brackets to make a list that’s just 
one element long. Using our call to fit() from the previous section, 
our call using checkpointing is shown in Listing 23.62. 


history = model.fit(X_train, y_train, 
validation_data=(X_test, y_test), 
epochs=100, batch_size=256, verbose=2, 
callbacks = [checkpointer] ) 


Listing 23.62: Calling fit () with our single-element list of callbacks. 


There are some useful options to ModelCheckpoint that can make it 
more useful. 


Writing out the complete model after every epoch may take up more 
disk space (and computer time) than we want to use. We can cut down 
on the size of the file by saving just the weights. To do this, set the 
optional argument save_weights_only to True (the default is False, 
so every files contains both the architecture and the weights). 


We might not need even the weights written out after every epoch. We 
can tell it to write out a file only periodically by setting the optional 
argument period to some value (the default is 1, meaning the file is 
written after every epoch). For example, if we set period to 5, then the 
file is only produced every 5th epoch. 


By default, the value that Keras can insert into the file name is the val- 
idation loss, val_loss. But we can ask it to use the validation error 
val_err, the training loss loss, or the training error err. We just use 
the name we want in the checkpoint file. 


For example, we can save the training accuracy by setting up the file- 
name as in Listing 23.63. 
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filename = 'SavedModels/model-weights-' 

filename += 'epoch-{epoch:03d}-acc-{acc:0.3f}.h5' 

checkpointer = ModelCheckpoint( filename, monitor='acc', 
save_weights_only=True, 
period=10) 


Listing 23.63: Saving the training accuracy in the checkpoint file name. 


When we run the code, we'll get filenames like those in Listing 23.64. 


model-weights-epoch-009-acc-1.000.h5 
model-weights-epoch-019-acc-1.000.h5 
model-weights-epoch-029-acc-1.000.h5 


Listing 23.64: File names created by Listing 23.63. 


We can easily accumulate a lot of these checkpoint files in a long train- 
ing run. We can tell ModelCheckpoint to only write a new file if some 
measurement is better than that in any previously-saved version. We 
do this by setting two parameters. 


First, we tell it that we want this mode by setting save_best_only to 
True (the default is False). 


Second, we tell it which parameter it should use to determine if this 
epoch’s results are “better” than any that has been already saved. As 
usual we can choose from the training accuracy 'acc', training loss 
‘Loss’, validation accuracy 'val_acc', and validation loss 'val_loss'. 
We pass the variable we want it to keep track of using the optional 
parameter monitor (the default is 'val_loss'). 


The system knows that the best loss is the smallest one, and the best 
accuracy is the largest one. 


For example, to only write a new file if it has better validation accuracy 
than any that have come before, we could use Listing 23.65. 
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checkpointer = ModelCheckpoint( filename, 
save_best_only=True, 
monitor='val_acc' ) 


Listing 23.65: Saving only the checkpointing file with the best validation 
accuracy. 


When the training is complete, the most recently-written file will be 
the one corresponding to the best value of the validation accuracy over 
the whole training run. Note that since we’re only printing three digits 
of the value, it might not be obvious that the accuracy has improved. 
For example, if it goes from 0.9353 to 0.9354, both files will list the 
accuracy in the file name as 0.953. By looking at the time stamps of 
the files, we can infer that the more recently written one is better. 


23.10.2 Learning Rate 


Another popular use of callbacks is to change the learning rate over 
time. As we saw in Chapter 19, many modern optimizers automatically 
adjust the learning rate adaptively (they usually have names that begin 
with “Ada” for “adaptive learning rate”). But if we choose to use some- 
thing like SGD, then we'll need to manage the learning rate ourselves. 


In Chapter 19 we saw a variety of strategies for adjusting the learning 
rate over time. For example, we might start with a large learning rate, 
and then shrink it either on every epoch, or in stair step fashion after 
each group of some fixed number of epochs. To pull off these strate- 
gies, or any others we might prefer, we use the built-in callback routine 
named LearningRateScheduler () 


The LearningRateScheduler callback is really just a little con- 
nection function between Keras and a function that we write. The 
LearningRateScheduler calls our function, and it returns the value 
that our function returns. The function we write must take one argu- 
ment: an integer with the epoch number that just finished as an input 
(this starts at 0). It must return a new floating-point learning rate as 
an output. 
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Listing 23.66 shows the idea. We start by compiling the model with the 
non-adaptive SGD optimizer. We’ve written a little scheduling routine 
that we’ve called simpleSchedule(). If we wanted to use checkpoint- 
ing here as well, we could create a ModelCheckpoint object as in the 
last section, and include it in the list we provide to callbacks. The 
order in which the callback routines are named in this list makes no 
difference. 


from keras.callbacks import LearningRateScheduler 
from keras.optimizers import SGD 


# make the model 

model = make_model() 

sgd = SGD(Lr=0.0, momentum=0.9, decay=0.0, nesterov=False) 

model.compile(loss='categorical_crossentropy', 
optimizer=sgd, metrics=['accuracy']) 


def simpleSchedule(epoch_number) : 
# start at 1 and drop to 0.1 
return max(.1, 1-(0.01*epoch_number) ) 


lr_scheduler = LearningRateScheduler (simpleSchedule) 
history = model. fit(X_train, y_train, 
validation_data=(X_test, y_test), 


epochs=100, batch_size=256, verbose=2, 
callbacks=[lr_schedulLer ] ) 


Listing 23.66: Setting up and using a learning-rate scheduler. 


23.10.3 Early Stopping 


Another popular use of callbacks is to implement early stopping. 
Recall from Chapter 9 that this involves watching the performance of 
our network and looking for signs of overfitting. When we see overfit- 
ting, we stop training. 
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So we stop “early” in the sense that we probably would have kept going 
if not for this intervention, but in fact we’re stopping at the right time 
to prevent overfitting. 


The built-in routine provided by Keras implements this idea by mon- 
itoring a statistic of our choice. When that value stops improving, it 
stops training. 


Early stopping is often used with checkpointing. We might tell our sys- 
tem to train for a ridiculous number of epochs, like 100,000 of them, 
and then go to lunch (or to sleep), leaving the computer to run, check- 
pointing the model every few epochs (or saving the best one according 
to some measurement). We count on the early stopping callback to stop 
training when our monitored statistic stops improving. Then when we 
return to the computer, we look through our saved files. Since the most 
recently-written file is usually the best-trained model, that’s the one 
we use from then on. 


Our callback is made by creating an instance of an EarlyStopping 
object. Let’s look at four of its useful options. 


First, we tell the system which value it should be watching. As usual, 
we can specify the training accuracy ‘acc’, training loss 'loss', val- 
idation accuracy 'val_acc’', or validation loss 'val_loss'. We hand 
our choice to the parameter named monitor. 


Second, we provide a value to a floating-point parameter called 
min_delta. The word “delta” refers to the Greek letter 6 (delta), which 
mathematicians often use to refer to the idea of “change.” In this case, 
min_delta is the minimum amount of change to the monitored value 
for EarlyStopping() to notice. Any change less than this amount is 
ignored. By default, this value is 0, so every time the monitored value 
changes, EarlyStopping() checks to see if we need to stop. That 
default is usually a good place to start. We can increase this value if 
we re getting way too many files. 
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Third, we provide a value to an integer called patience. As the sys- 
tem watches our chosen parameter from one epoch to the next, there 
might be some ranges of time where it doesn’t improve, or even gets a 
little worse. We don’t want to give up as soon as this happens, because 
it might just be a temporary effect. As we’ve seen, the accuracy and 
loss curves are often a bit noisy and jump around a little. We only want 
to call a halt if the value we’re watching is really getting worse over 
the long term. The value we assign to patience tells the routine how 
long the “long term” is. It’s the number of epochs to wait for things to 
get better before deciding that fit() should stop training. The default 
value of patience is O, which is usually too aggressive. This is a param- 
eter that’s best set after a bit of experimentation to see how noisy the 
results are. 


Finally, we can also set a value to verbose to have it print out a line 
of text if it decides to stop training, so we can look at the output and 
know that it intervened. 


Listing 23.67 shows how to set up and use this callback. We'll watch 
the validation loss, set patience to 10 epochs, and verbose to 1 So we 
get a notice when EarlyStopping() decides we should indeed stop. 


from keras.callbacks import EarlyStopping 
early_stopper = EarlyStopping(monitor='val_loss', 
patience=10, verbose=1) 
history = model.fit(X_train, y_train, 
validation_data=(X_test, y_test), 
epochs=100, batch_size=256, verbose=2, 
callbacks=[early_stopper] ) 


Listing 23.67: Setting up and using Ear LyStopping() to stop training 
when the validation loss stops dropping for more than 10 epochs. 
Let’s run this and see what happens. 


Listing 23.68 shows the result. At epoch 23 wesee that EarlyStopping() 
has decided we need to stop. Since we set patience to 10, and we’re 
monitoring the validation loss, this tells us that the validation loss 
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hasn’t improved since epoch 13. So training ends after the 23rd epoch 
and fit() returns. It’s just as if we'd interrupted the training process 
ourselves. 


Epoch 22/100 

3s-loss: 5.8155e-04-acc: 1.0000-val_loss: 0.0636-val_acc: 0.9834 
Epoch 23/100 

3s-loss: 4.4813e-04-acc: 1.0000-val_loss: 0.0631-val_acc: 0.9825 
Epoch 24/100 

3s-loss: 4.0089e-04-acc: 1.0000-val_loss: 0.0647-val_acc: 0.9828 
Epoch 00023: early stopping 


Listing 23.68: The last few epochs of training with early stopping. At 
epoch 22 the system decides we ought to stop training. 


The accuracy and loss graphs for this run are shown in Figure 23.32 
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Figure 23.32: Accuracy and loss for our early-stopping run. Note that 
the validation loss settles down at about epoch 13. The Early stopping 
routine we set up waits another 10 epochs for improvement, and then 
halts training at epoch 23. 


The validation loss that we’re monitoring seems to stop improving at 
around epoch 13. We can be sure of that, because our early stopping 
callback halted training 10 epochs later, at epoch 23. The validation 
loss might be starting to rise just a little bit, but we’ve definitely avoided 
the rising slope of the overfitting curve we saw in Figure 23.31 when 
we trained for 100 epochs. 
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Experimenting with the patience value allows us to tune the perfor- 
mance of the EarlyStopping() routine to our network and data. As 
we mentioned above, we can always use an early stopping algorithm of 
our own and use that instead [ZFTurbo16]. 


Early stopping is the solution we promised earlier to the problem of 
picking the wrong value for epochs when calling fit(). With early 
stopping in place, we can always pick a ridiculously large number for 
epochs, and let the computer automatically stop training at the right 
time. 


References 


[Benenson16] Rodrigo Benenson, “What is the class of this image?”, 
2016. http://rodrigob.github.io/are_we_there_yet/build/classifi- 
cation_datasets_results.html 


[Bengio17] Yoshua Bengio, “MILA and the future of Theano”, 
email thread on theano-users Google Group, September 
28, 2017. https://groups.google.com/forum/#!topic/ 
theano-users/7Poq8BZutbY 


[Cholleti7a] Francois Chollet, “Keras Documentation”, 2017. https:// 
keras.io/ and https://github.com/fchollet/keras 


[Chollet17b] Francois Chollet, “Deep Learning with Python”, Manning 
Publications, 2017. 


[Chomsky57] Noam Chomsky, “Syntactic Structures”, Mouton & Co., 
1957. 


[CNTK17] Microsoft, “The Microsoft Cognitive Toolkit”, Microsoft 
Cognitive Toolkit, 2017. https://docs.microsoft.com/en-us/ 
cognitive-toolkit/index 


1205 


Chapter 23: Keras Part 1 


[Devlin16] Josh Devlin, “28 Jupyter Notebook tips, tricks, and short- 
cuts”, Dataquest.io, 2016. https://www.dataquest.io/blog/ 
jupyter-notebook-tips-tricks-shortcuts/ 


[Dijkstra82] Edsger W. Dijkstra, “Why Numbering Should Start at 0”, 
August, 1982. https://www.cs.utexas.edu/users/EWD/ewd08xx/ 
EWD831.PDF 


[Lorenzo17] Baraldi Lorenzo, “VGG-16 pre-trained model 
for Keras”, Github, 2017. https://gist.github.com/ 
baraldilorenzo/07d7802847aaad0a35d3 


[Gupta17] Dishashree Gupta, “Transfer Learning and The Art of Using 
Pre-trained Models in Deep Learning,” Analytics Vidhya 
blog, 2017. https://www.analyticsvidhya.com/blog/2017/06/ 
transfer-learning-the-art-of-fine-tuning-a-pre-trained-model/ 


[HDF517] The HDF5 Group, “What is HDF5?”, HDF Group Support 
Page, 2017. https://support.hdfgroup.org/HDF5/whatishdf5.html 


[JetBrains17]| Jet Brains, “Pycharm Community Edition IDE”, 2017. 
https://www.jetbrains.com/pycharm/ 


[JSON13] JSON Contributors, “Introducing JSON”, ECMA-404 JSON 
Data Interchange Standard Working Group, 2013. https:// 
www.json.org 


[Jupyter16] The Jupyter team, 2016. http://jupyter.org/ 


[Karpathy16] Andrej Karpathy, “Transfer Learning”, Stanford CS 231 
Course Notes, 2016. https://cs231n.github.io/transfer-learning/ 


[Kernighan78] Brian W. Kernighan and Dennis M. Ritchie, “The C 
Programming Language (ist ed.)”, Prentice Hall, 1978 


[Khronos17] The Khronos Group, “The open standard for parallel 
programming of heterogeneous systems”, Khronos Group 
Website, 2017. https://www.khronos.org/opencl/ 


1206 


Chapter 23: Keras Part 1 


[LeCun13] Yann LeCun, Corinna Cortes, Christopher J.C. Burges, 
“The MNIST Database of Handwritten Digits”, 2013. http:// 
yann.lecun.com/exdb/mnist/ 


[NVIDIA17] NVIDIA Corp, “CUDA Home Page”, NVIDIA Website, 
2017. http://www.nvidia.com/object/cuda_home_new.html 


[Ramalho16] Luciano Ramalho, “Fluent Python”: Clear, Concise, and 
Effective Programming, O’Reilly Books, 2016. 


[Sato17] Kaz Sato, Cliff Young, and David Patterson, “An 
in-depth look at Google’s first Tensor Processing Unit 
(TPU)”, Google Cloud Big Data and Machine Learning 
Blog, 2017. https://cloud.google.com/blog/big-data/2017/05/ 
an-in-depth-look-at-googles-first-tensor-processing-unit-tpu 


[TensorFlow16]| Martin Abadi, Ashish Agarwal, Paul Barham, Eugene 
Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy 
Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian 
Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, 
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath 
Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry 
Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon 
Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul 
Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda 
Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, 
Martin Wicke, Yuan Yu, and Xiaoqgiang Zheng, “TensorFlow: 
Large-scale machine learning on heterogeneous systems’, 
2016. http://tensorflow.org 


[Theano16] Theano Development Team, “Theano: A Python frame- 
work for fast computation of mathematical expressions’, 
https://arxiv.org/abs/1605.02688 For online documentation, see 
http://deeplearning.net/software/theano/index.html 


1207 


Chapter 23: Keras Part 1 


[Wentworthi2] Peter Wentworth, Jeffrey Elkner, Allen B. Downey 
and Chris Meyers, “How to Think Like a Computer Scientist: 


Learning with Python 3”, “Chapter 9: Tuples”, 2012. http:// 
openbookproject.net/thinkcs/python/english3e/tuples.html 


[Wikipedia17] Wikipedia authors, “Iris Flower Data Set,” Wikipedia, 
2017. https://en.wikipedia.org/wiki/Iris_flower_data_set 


[YAML11] YAML Contributors, “YAML Home Page”, 2017. http://yaml. 
org/ 


[ZFTurbo16]| ZFTurbo, “How to tell Keras stop train- 
ing based on loss value?”, Stack Overflow, 2016. 
http://stackoverflow.com/questions/37293642/ 
how-to-tell-keras-stop-training-based-on-loss-value 


Image Credits 


Figure 23.1, Giraffe 
https://pixabay.com/en/giraffe-wildlife-safari-africa-2868936 


1208 


Chapter 24 


Keras 
Part 2 


We'll expand our discussion of 

the Keras library to include model 
improvement, parameter searching, 

and how to build CNN and RNN models. 
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24.1 Why This Chapter Is Here 


In Chapter 23 we introduced the Keras library and looked at how to 
build and train basic models. 


Now we'll expand our horizons. We'll see how to improve our models, 
incorporate search routines from the scikit-learn library (discussed in 
Chapter 15), and build more complex models such as CNNs and RNNs. 


24.2 Improving the Model 


This section’s notebook is 
Keras-Notebook-09-Improving-the-Model.ipynb 


We’ve explored a lot of the features Keras offers using our tiny 2-layer 
model. As we just saw, after only about 20 epochs of training this 
model was able to accurately classify about 98% of the images in the 
MNIST test set. 


Let’s see if we can improve that. How might we build a better model? 


The answer is not obvious. It’s made more difficult by the sheer num- 
ber of choices that we can try out. Though we breezed by many of these 

choices earlier in Chapter 23, even in this extremely simple model with 

one hidden dense layer, we’ve made many choices, all of which influ- 
ence how well and how fast our network learns. 


Though sometimes a change to our model can bring about a big 
improvement in accuracy, much of the time improving a model’s per- 
formance is a game of accumulating a sequence of tiny improvements. 
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24.2.1 Counting Up Hyperparameters 


Before we start modifying the hyperparameters of our model, let’s get 
a clearer picture of just how many choices we’ve made. Note that we’ve 
not counting up the weights, which aren’t under our control. We’re 
just looking at all the places where we could make a different decision 
in the design of our model. 


Many of the routines we’ve used take multiple optional arguments that 
we ve ignored. In a sense, we’ve chosen values for those arguments by 
letting them stay at their defaults. So let’s count those, too. 


Each of our two Dense layers took 2 arguments: the number of neu- 
rons, and the activation function. Consulting the Keras documentation, 
there are at least 7 more arguments that we could reasonably experi- 
ment with. 


Then there’s the choices we made when we compiled the model. We 
chose a loss function and an optimizer, giving us 2 more options to 
adjust. 


Given that we chose the adam optimizer, there are 5 optional argu- 
ments that we can use to tune its behavior. 


Finally, we supply a host of options when we call fit() to train the 
model. We have choices for the batch size and the number of epochs 
(we could argue that the number of epochs doesn’t matter if we use 
early stopping, but then we’d have to set a value for that algorithm’s 
patience). So we have at least 2 arguments here. 


So in this casual tour of our choices, we’ve got 9 layer-level choices on 
each of 2 layers for a total of 18 choices, 2 choices when we compile, 
5 more choices for our optimizer, and at least 2 choices when we fit. 
That’s a total of 27 hyperparameters for this tiny model. 


The number goes up fast as we add more layers. 
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Figuring out what changes to these choices will make the model better 
is a daunting task. Imagine sitting at a control panel with 27 sliders, 
switches, and knobs. This is just to control basic performance, and 
doesn’t include controls for additional options like adjusting the learn- 
ing rate schedule. 


We might set the controls, push the big red button to train the net- 
work, wait for a while, and eventually look at the numbers that report 
how we did. Then we could adjust one or more controls in the hopes of 
making things better, and repeat. 


Complicating the problem is that many of the hyperparameters inter- 
act. So if we increase one value, we might only see an improvement 
if we simultaneously decrease two or three other values, and increase 
one or two others. 


Things get even harder when we want to improve larger and deeper 
models with dozens of layers. The number of choices and their possi- 
ble settings becomes enormous. 


This is why we’ve gone through so many chapters of information to 
get here. The only chance we have of improving our model’s perfor- 
mance is to draw from our knowledge about what the network is doing, 
and why, and what all of our choices do. When we understand what’s 
happening inside, we have a fighting chance of learning from our expe- 
rience and developing the intuition and hunches that are essential to 
building great deep-learning networks. 


Though we almost always have to run experiments and see what hap- 
pens, our knowledge and experience improve our chances of making 
things better. 


24.2.2 Changing One Hyperparameter 


A frequent rule in experimentation of all sorts is to change only one 
thing at a time and see what happens. This is a good plan if the values 
involved are largely decoupled, meaning that they don’t affect one 


1214 


Chapter 24: Keras Part 2 


another. As an analogy, suppose we’re adjusting the sound of our car 
radio, and boosting or cutting the highs and lows. The results of these 
choices combine, but they’re independent: generally speaking, add- 
ing more treble doesn’t change how much bass sound is delivered, and 
vice-versa. 


Unfortunately, the hyperparameters of most real systems, and most 
deep-learning systems, are not decoupled. If we increase the amount 
of hyperparameter A and find things get better, and then increase 
the amount of hyperparameter B, we may find that we now have to 
decrease the value in A to make further progress. The connections are 
complex. 


But still, changing one hyperparameter at a time is usually a good way 
to start. We can explore what that value does, find a good value for it, 
and then choose another hyperparameter to adjust, and so on, search- 
ing for a good combination by fine-tuning one hyperparameter at a 
time. If we have to go back, then our experience with each value can 
help guide us to select which one to adjust again, and by how much. 
We can also build up a sense of which values are related to which oth- 
ers, SO we can anticipate their interactions. 


Let’s try that now, arbitrarily picking the batch size as our first hyper- 
parameter to experiment with. We said above that when we’re using a 
GPU we pick a batch size that best fits our particular hardware. But on 
a CPU we can pick almost any value we like. We’ve been using a batch 
size of 256, but like most of our initial choices for each of the 27 hyper- 
parameters we just counted up, it was really just a shot in the dark. 
Let’s try cranking that up and down and see what happens, if anything. 


Figure 24.1, Figure 24.2, Figure 24.3, and Figure 24.4 show the results 
of setting this hyperparameter to 2048, 512, 64, and finally 8. We used 
the same code for every run, changing only the batch size. Note that 
the vertical scale on the graphs is not the same from one graph to the 
next. This allows us to show all the data, though it means we can’t com- 
pare them equally at a glance. 
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Figure 24.1: Training our two-layer model with a batch size of 2048. 
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Figure 24.2: Training our two-layer model with a batch size of 512. 
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Figure 24.3: Training our two-layer model with a batch size of 64. 
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Figure 24.4: Training our two-layer model with a batch size of 8. 
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Three things jump out from these figures. 


First, as the batch size gets smaller, the results get more jittery, or noisy. 
This is because each new update is working with fewer samples, so it’s 
responding to whatever happens to be in that batch. Larger batches 
tend to become more representative of the dataset as a whole, and give 
us smoother results. Smaller batches give us a lot of jumping around. 


The second thing is that the training accuracy is about 98% on all the 
models, so the batch size didn’t affect that accuracy very much. 


The third thing is that although all of the models are overfitting, as 
demonstrated by the diverging training and validation losses, as the 
batch size gets smaller the divergence of the training and validation 
error increases. In other words, the amount of overfitting increases. 


Smaller batches mean that epochs take longer, because we need to 
perform backprop and update the weights more frequently. Figure 
24.5 shows the clock time, in seconds, for each of the above batch sizes, 
plus the other powers of 2 between them. 
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Figure 24.5: The timing results (running time in seconds, on a late 2014 
iMac) taken by the experiments whose data are shown in Figure 24.1 
through Figure 24.4, as well as other intermediate batch sizes. Note that 
the vertical scale is linear while the horizontal scale is not. 
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The curve in Figure 24.5 confirms that on these CPU-only runs, as the 
batch size went down, we’re running more backprop and update steps, 
so the total training time went up. 


The above experiments tell us a lot about how the batch size affects 
training for this dataset on this model. They suggest that for this model 
and data, large batch sizes are more desirable than small ones. 


24.2.3 Other Ways to Improve 


When we seek to improve a model, it can help to keep in mind that 
we'll be unlikely to find the very best set of parameters for the training 
speed and accuracy that we’re after. Instead, we look for parameters 
that come close enough. 


It also helps to have one goal in mind at every step in the search. We 
might be looking to reduce overfitting, or drive down the test loss, or 
increase the test accuracy, or speed up training time, or fit best onto 
the GPU, or use the least computer memory, and so on. We're unlikely 
to be able to improve all of these at once. 


So we typically pick out just one or two things to improve, and then 
modify some of our variables until they’re as good as we can get them. 
Then we move on to another group of criteria, and look for the vari- 
ables that will help with those, and so on. 


For the MNIST problem, let’s aim to improve accuracy while reducing 
overfitting. 


Rather than continue to adjust hyperparameters, let’s try something 
radically different: adding a second Dense layer. 


In order to keep everything comparable with the models earlier in this 
chapter, we'll return to a batch size of 256, and leave out early stopping. 


1218 


Chapter 24: Keras Part 2 


24.2.4 Adding Another Dense Layer 


Let’s add a second dense hidden layer, just as big as the first. After all, 
having more neurons means more ability to learn, right? 


Not really. We’ve seen that our single-layer model is already too capa- 
ble of learning the idiosyncrasies in the training data. That’s why it’s 
overfitting. If we throw in yet more neurons without making any other 
structural changes, then this overfitting should get worse, faster. 


Let’s try it and see. 


Figure 24.6 shows the architecture for a model with two dense hidden 
layers, each with as many neurons as there are input elements (that 
is, each layer has 784 neurons), followed by a 10-neuron output layer 
with softmax on the output. 


784 784 10 
ReLU ReLU softmax 


Figure 24.6: The architecture of our three-layer model. Each of the hidden 
layers is a dense layer with 784 neurons, one for each input. Though our 
convention is that dense layers have a ReLU activation function by default, 
for completeness we're listing it here explicitly. 


We can make this model in Keras by adding just one line to our mod- 
el-making routine, creating the second dense hidden layer. This 
line looks just like the one above it except that we don’t include the 
input_shape argument, since that’s only used by the first layer in the 
model. Listing 24.1 shows the code. 
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make_two_hidden_layers_model(): 
model = Sequential() 
modeLl.add(Dense(number_of_pixels, 
input_shape=[number_of_pixels], 
activation='relu')) 
# add layers 
model.add(Dense(number_of_pixels, activation='relu’ ) ) 
model.add(Dense(number_of_classes, activation='softmax' ) ) 
model.comp7i le(loss='categorical_crossentropy' , 
optimizer='adam', 
metrics=['accuracy' ]) 
return model 


Listing 24.1: Create and compile the same network as in Chapter 23, but 
with two identical Dense layers in a row. 


Figure 24.7 shows the accuracy of the training and validation sets over 
100 epochs of learning (we did not use early stopping). 


accuracy 


Two hidden layers, Loss 
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Figure 24.7: The accuracy and loss of our two-layer network plotted 
against the number of epochs. 


Compared to our previous results, the new curves are wigglier in the 
starting epochs, suggesting that the bigger network took more time to 
settle down. And just as we suspected, the system overfit the train- 
ing data just like before, but it did so even more quickly, driving up 
the validation loss faster than before. Back in Chapter 23, we saw that 
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after an initial drop, the validation loss climbed from about 0.06 to 
about 0.09 over 100 epochs. Now the validation loss is over 0.10, and 
it’s rising. The system is overfitting faster, as expected. 


So just throwing more neurons at the problem did not make everything 
better. Validation accuracy improved a touch, but the loss is looking 
much worse, and we're still overfitting considerably. 


24.2.5 Less Is More 


Having too many neurons has made our network too capable. It had 
more than enough power for this task, so it used its extra abilities to 
extract more and more idiosyncratic detail from the training set, and 
thus overfit. 


We generally want the smallest, simplest network that will get us the 
results we’re after. A simpler network not only trains and predicts 
faster, but it’s less prone to overfitting because there’s less superflu- 
ous computational power to get distracted by irrelevant details in the 
training data. 


Let’s go back to our single dense layer, but make it far smaller, with 
only 64 neurons. This gives us roughly one neuron for every 12 input 
pixels. The new architecture is shown in Figure 24.8. 


64 10 
ReLU softmax 


Figure 24.8: A new two-layer network where we'll only use 64 neurons in 
the first, fully-connected hidden layer. 


We'll just change the line that defines this layer to give it 64 neurons 
rather than 784. Listing 24.2 shows the change. 
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def make_smaller_one_hidden_layer_model(): 
model = Sequential() 
modeLl.add(Dense(64, input_shape=[number_of_pixels], 
activation='relu')) 
model.add(Dense(number_of_classes, activation='softmax' ) ) 
model.compile(loss='categorical_crossentropy', 
optimizer='adam', 
metrics=['accuracy' ]) 
return model 


Listing 24.2: Building a model where the first (and only) hidden layer has 
just 64 neurons. 


The accuracy and loss results for 100 epochs are shown in Figure 24.9. 


64-neuron hidden layer, Accuracy 64-neuron hidden layer, Loss 
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Figure 24.9: The accuracy of our model of Figure 24.8. 


60 
epochs epochs 


Our network is giving us a bit less accuracy than the one with 784 neu- 
rons in the first layer, but even so, it’s still overfitting. What to do? 
Let’s try using our idea from the last section and use two hidden layers 
instead of one, but we'll keep the same number of neurons and split 
them evenly. In other words, we'll have two hidden layers of 32 neu- 
rons each, as shown in Figure 24.10. 
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32 32 10 
ReLU ReLU softmax 


Figure 24.10: A deeper model with three layers. We're splitting up the 
64-neuron hidden layer of our previous model into two separate, fully-con- 
nected 32-neuron hidden layers. 


The accuracy and loss results for 100 epochs of training are in Figure 
24.11. 
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Figure 24.11: The accuracy of the model in Figure 24.10. 


The validation accuracy after 100 epochs has decreased a little bit 
from about 97.5% for our single 64-neuron layer to about 97% for our 
new, two-layer network. The loss has increased, too, and we seem to 
be overfitting even more rapidly than before. 


We've taken a step backwards. A small step, granted, but the measure- 
ments are worse and we're still overfitting. 


Though chopping up layers into multiple, smaller pieces can some- 
times work, it didn’t help us much in this example. 


This is often the way it goes: we try one thing and another, following 
up on ideas that work and setting aside those that don’t make things 
better. 
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Before we give up on these two small layers, let’s see if we can try 
another trick to get the overfitting under control. 


24.2.6 Adding Dropout 


Since overfitting is a problem for this model and data, let’s try using 
dropout. As we discussed in Chapter 20, this is a regularization tech- 
nique explicitly designed to address overfitting. Dropout temporarily 
removes a random selection of neurons before each epoch, and puts 
them back in at the end. The intuition is that our neurons will be less 
likely to specialize (and potentially over-specialize), since they all need 
to be able to compensate for randomly missing neurons. 


To apply dropout in Keras, we create a new dropout layer and add it 
to the growing stack, just after the layer we want to remove some nodes 
from. When dropout is applied, randomly-chosen neurons are isolated 
from the network for one epoch, so they don’t contribute to predic- 
tions, and they don’t learn when the network’s weights are updated. 
When the epoch is done, the neurons are restored, and before the next 
epoch, a new random collection gets disconnected. 


It might seem a little weird that dropout is included as a layer. It 
doesn’t have any neurons, and it doesn’t participate in backprop or 
computation, so how can it be a layer? Calling this a layer is really 
just a conceptual device. We’d like to apply dropout to not just Dense 
layers, but other types of layers, like the convolution and recurrent 
layers we'll cover later in this chapter. Rather than build dropout into 
each layer, Keras lets us specify this kind of “informational” layer that 
doesn’t do any computing, but tells Keras about something we want 
it to do. Thinking of operations like dropout as implemented by their 
own layers lets us keep our conceptual view of our model simple and 
clean. We just have a big stack of layers. Some layers have neurons, 
and others perform operations on other layers or on data. 
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In this case, the dropout layer says to Keras, “apply dropout to the pre- 
ceding layer.” If there are, say, 3 Dense layers preceding this dropout 
layer, only the most recent one is affected. If we wanted to apply drop- 
out to all three Dense layers, we’d have to follow each one individually 
with its own dropout layer. 


The dropout layer in Keras takes only one parameter, and it’s man- 
datory. It’s a floating-point number between o and 1 that describes 
the percentage of neurons that will be temporarily removed after each 
batch. A value of o disables dropout, while a value of 1 would make the 
preceding layer effectively disappear. The authors of the original paper 
on dropout advise a value of 0.2, and that’s generally a good place to 
start [Srivastava14 ]. 


The authors also advise constraining the total size of the weights on 
the dense layers that are affected by dropout. Speaking generally, the 
concern is that when some nodes are removed, the others might over- 
compensate by cranking their weights up very high. Without getting 
into the math, we can take their advice by setting an optional parameter 
on the Dense layer that’s going to experience dropout. The parame- 
ter is called kernel_constraint, and the advice of the authors of the 
paper cited above is to set that to the value 3, so we'll do just that. We 
only need to add this option to Dense layers that will have dropout 
applied to them, as we'll see in a code listing just below. 


The complete model specification for our two-layer model, with drop- 
out, is shown in Figure 24.12. Here we're applying dropout to both of 
our two hidden layers. 


32 0.2 32 0.2 10 
ReLU ReLU softmax 


Figure 24.12: We'll change our model in Figure 24.10 to add dropout layers 
after each 32-neuron, fully-connected, hidden layer. Here we're using our 
symbol for dropout: a diagonal slash through the line connecting two 
layers. Each dropout layer applies to the layer preceding it. 
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In Figure 24.12 we show our schematic symbol for a dropout layer, 
which is a slanted line crossing the line carrying data, suggesting that 
some of the data is being struck out, or removed. 


The code for making this model is in Listing 24.3. There are a few new 
things happening in this code. 


from keras. layers import Dropout 
from keras.constraints import maxnorm 
def two_layers_with_dropout_model(): 
model = Sequential() 
model.add(Dense(32, input_shape=[number_of_pixels], 
activation='relu', 
kernel_constraint=maxnorm(3) ) ) 
model.add(Dropout (0.2) ) 
modeLl.add(Dense(32, 
activation='relu', 
kernel_constraint=maxnorm(3) ) ) 
model.add(Dropout (0.2) ) 
model.add(Dense(number_of_classes, activation='softmax' ) ) 
# compile the model to turn it from specification to code 
model.compile(loss='categorical_crossentropy' , 
optimizer='adam', 
metrics=['accuracy']) 
return model 


model = two_lLayers_with_dropout_model() 


Listing 24.3: Two dense layers of 32 neurons each are both followed 
by Dropout layers. We use a dropout percentage of 0.2. We also set 
kernel_constraint in the Dense layers. 


First, of course, we’re adding dropout layers. The argument 0.2 tells 
the layer to use the 20% dropout rate suggested by dropout’s creators. 


As we mentioned above, the original paper on dropout also suggested 
imposing a technical condition on the weights in the layer that’s expe- 
riencing the dropout, and that advice is widely followed. In Listing 
24.3 we do this by adding the optional argument kernel_constraint 
to the argument list for each layer that will be affected by dropout, and 
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setting that parameter’s value to maxnorm(3) (note that we have to 
import maxnorm() in order to use it). The thinking behind this step, 
which explains what this maxnorm() thing is doing, is explained in the 
original paper [Srivastavai4]. It’s reasonable to just think of it as a 
mechanism to keep the weight values from getting too big. 


Training this model for 100 epochs produces the results in Figure 
24.13. 
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Figure 24.13: Accuracy and loss for the model of Figure 24.12. 


We've conquered over overfitting problem! The losses are no longer 
diverging. Dropout has done a great job for us. 


The accuracy is a bit weird, since we’re getting better accuracy on the 
validation data than the training data. The validation accuracy seems 
to have taken a small hit, too, since it’s not up to the 98.3% from before. 
We might be able to tweak our accuracy upwards by reducing the drop- 
out rate a bit, or adding a few more neurons to our dense layers. 


The dropout paper also recommends that that we configure our opti- 
mizer to use a learning rate 10-100 times larger than we normally 
would. We can tell Adam to start with any specific learning rate by set- 
ting its optional argument lr (that’s a lower-case letter L followed by 
a lower-case letter R, standing for “learning rate”). This value defaults 
to 0.001. 
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To pass this argument to Adam, we have to make an Adam object as we 
did earlier, and pass it our new value to the learning rate parameter. 
Listing 24.4 shows how we'd set the initial learning rate to 0.1. This 
would replace the line previously calling model. compile(). 


from keras.optimizers import Adam 


# make our own Adam object 
adam_optimizer = Adam(1lr=0.1) 


# optimizer gets our object, rather than a string 
model.compile(loss='categorical_crossentropy', 


optimizer=adam_optimizer, metrics=['accuracy' ]) 


Listing 24.4: We can provide our own optimizer object when we compile, 


rather than rely on a default. Here we make an Adam with our own choice 
of learning rate. 


A shorter way to write this is shown in Listing 24.5, where we create 


the Adam object and assign it, without needing a temporary variable to 
hold it. 


modeLl.compile(loss='categorical_crossentropy', 
optimizer=Adam(1lr=0.1), 
metrics=['accuracy']) 


Listing 24.5: We don't need to store our new Adam object in its own vari- 
able. This shorter approach is more common. 


The surprising results are shown in Figure 24.14. 
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Figure 24.14: The accuracy and loss for our model with dropout when we 
set Adam's initial learning rate to 0.1 


Wow. These graphs are as bad as they look. 


For this data and architecture, starting Adam with a learning rate of 0.1 
was much too aggressive. The training accuracy plummeted to about 
0.18, which is terrible. The validating accuracy seems to be fluttering 
around 0.2, but it’s got a lot of noise. The loss was also terrible, more 
than 10 times worse than before. 


If we drop the learning rate down to 0.01, we get much more encour- 
aging performance, as shown in Figure 24.15. 
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Figure 24.15: Accuracy and loss for our dropout model with Adam’s initial 
learning rate set to 0.01. 
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These results aren’t nearly as good as what we got with the default 
learning rate of 0.001, but it was worth a shot. Things are much calmer, 
and our accuracies are both above 90%. And we're still not overfitting. 


We could try out a variety of learning rates to try to see what value 
works the best for this model and data, but that would be a lot of typ- 
ing and waiting. 


It would be really nice if we could automate this search, so the com- 
puter could try out a variety of learning rates for us while we do other 
things. In an upcoming section, we'll see how to do just this using tools 
from scikit-learn. 


24.2.7 Observations 


We've only begun the process of tuning and refining our model. 
Tinkering with a practice model like this one in search of the best results 
is time well spent, since it hones our intuition and can help guide our 
choices with other, larger, databases and models in the future. 


This is one of those times when saying “the exercise is left to the reader” 
is completely appropriate. There’s no substitution for sitting down 
with a deep learning model and adjusting its structure and hyperpa- 
rameters to get a feeling for how that model behaves on that data. 


When we work with models that we will actually be deploying in the 
real world, we want the best-performing models we can develop. When 
models become so big that they can take days (or even weeks) to train, 
it’s important to have a good sense of what’s likely to work, since it lets 
us start much closer to the finish line than just assembling a model at 
random. Even if we automate the parameter search, we still want to 
focus our search where it’s going to pay off the most. 


We found that for this data a single giant fully-connected hidden layer 
was overkill. It started overfitting almost immediately, and the train- 
ing accuracy went flat. Worse, the training loss was steadily climbing, 
which would eventually cause the training accuracy to drop. We were 
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throwing a network with too many computing resources at the prob- 
lem, and it used those resources to overfit, learning far too much about 
the idiosyncrasies of the training data even after it had perfect accuracy. 


By significantly reducing the size of that layer we got away from over- 
fitting, but our accuracy dropped. 


Then by splitting that single layer into two pieces, and adding dropout, 
we got performance up, and stopped overfitting. 


There are many other things left to try. Using more layers is always 
something to consider. Perhaps each layer could be smaller than the 
one before it, forcing the system to look for larger patterns. Or per- 
haps we could have one small “choke” layer between two larger ones. 
We might try applying dropout only on some of the layers, or apply- 
ing it more aggressively (that is, raising the number of neurons we’re 
suppressing). 


24.3 Using Scikit-Learn 


This section’s notebook is 
Keras-Notebook-10-scikit-learn.ipynb 


So far we’ve been searching our hyperparameters by hand. It’s been 
illuminating, but it also required a lot of manual effort. 


We saw in Chapter 15 that the scikit-learn library offers us routines to 
cross-validate our model (to estimate how good it is), and grid-search 
its hyperparameters (to find the best-performing combination). 


Keras doesn’t offer either of these tools directly, because it offers a way 
to use the ones already in scikit-learn. 


Let’s pause a moment to think about what we might be asking for from 
these tools. If a model takes 3 hours to train, then running cross-val- 
idation with 10 folds will take about 10 times longer, or 30 hours. If 
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were grid-searching over, say, three hyperparameters with 5 values 
each (which is not a very large search), then it will take 125 times lon- 
ger than that, or more than five months! 


Is there any way to cut this down? 


A popular approach is to extract a tiny piece of the data set, carefully 
selected to be representative of the whole, and search on that. Then 
each training run will be much faster. 


By cross-validating and grid-searching one or more of these little proxy 
databases, we can get some guidance for what models and hyperpa- 
rameters are worth exploring on a larger scale. Then we can take that 
knowledge and work with larger and larger pieces of the dataset, tun- 
ing the hyperparameters at each step. The hope is that by the time we 
reach the full database, we'll have a great set of hyperparameters to 
train on and we'll need only a little searching, or perhaps even none at 
all. 


24.3.1 Keras Wrappers 


It would be nice to use scikit-learn’s cross-validation and grid-search 
tools directly on our Keras models. But Keras is a library that sits “on 
top” of scikit-learn. This means that scikit-learn doesn’t know any- 
thing about Keras and its models. But it also means that Keras knows 
everything about scikit-learn. 


In particular, Keras knows how scikit-learn expects its estimators to 
behave. With that, Keras can dress up one of its models to act like a 
scikit-learn estimator, bridging the gap. 


This act of camouflage lets us place a Keras model into scikit-learn, 
and then do cross-validation, grid search, or any other operation we 
like. From scikit-learn’s perspective, this object is just some custom 
estimator that we wrote and gave to it. It doesn’t know that there’s a 
deep network hiding inside. 
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We pull off this trick by embedding our Keras model in an object of 
type KerasClassifier or KerasRegressor, depending on the job it 
does. These objects are called wrappers, since they “wrap” our Keras 
model in a disguise that makes it look and act like a scikit-learn esti- 
mator. We don’t have to modify our network in any way to wrap it. We 
just make a wrapper object, place our network inside, and we're done. 


Since both wrappers work identically, we'll choose KerasClassi fier 
as an example so we can stick with the MNIST classifiers we’ve been 
discussing so far. 


We don’t actually hand our model to the wrapper function. Instead, 
we hand it the name of a function that builds the model and returns it. 
This makes sense when we think about it. The searching process, for 
instance, may create many versions of our model with different hyper- 
parameters. If we gave it a built and compiled model, there wouldn’t 
be any way for it to make different versions. By giving it a function that 
builds the model, the searching program can call the function, and 
intervene to set various parameters as it desires. 


The other arguments to the wrapper creator are arguments that get 
passed on. Some are given to the model-making function, and others 
get passed to scikit-learn. 


Let’s dig in, starting with the model-making function. 


This argument is named build_fn, short for “build function.” Its value 
is a function that we’ve written which will construct, compile, and 
return a Keras model, just like we’ve been seeing in listings like Listing 
24.3, where the function two_layers_with_dropout_model() madea 
model, compiled it, and returned it. 


There are some advanced options, but usually we _ assign 
this argument the name of a function in our code, such as 
two_layers_with_dropout_model. Note that we leave off the paren- 
theses, since we’re not calling the function, but only providing its 
name. Often we’d like to parameterize this function with arguments. 
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For instance, we might be searching for the best number of neurons to 
use in the first two layers. So we make those numbers arguments that 
we use when we make the model. 


As we said before, this model making-function will be called auto- 
matically by scikit-learn when the model is required. When we're grid 
searching, the model will usually be built over and over again at the 
start of each new step of the search. 


So we have our model-making function that takes arguments, and a 
wrapper, and scikit-learn which is going to call our function. How do 
we get scikit-learn to include the arguments we want when it calls the 
model-making function? 


Happily, the mechanism is easy. The trick is in the naming of our 
arguments. Recall from Chapter 15 that when we create a search using 
scikit-learn, we provide it with a dictionary that names each parame- 
ter we want it to search on as a key, with values to be tried as the value. 
Python could then match up those dictionary names with the names of 
parameters in the functions it called. 


In the case of a wrapper, things are even easier. Thanks to Python’s 
ability to “know” what the parameter names are in functions, we don’t 
even need the dictionary. We can just name the parameters we want to 
assign values to, along with the values we want them to have. 


For instance, if our model-making function takes a parameter to control 
the number of neurons it makes, perhaps called number_of_neurons, 
then we can place that into the wrapper’s argument list with a value, 
just as if we were assigning a value to it when calling the function. Any 
arguments in our model-making function whose names match argu- 
ments in the wrapper-making step will be given the assigned values. 


To see this in action, let’s start with a model-making function that 
takes parameters. Listing 24.6 shows an example, building on our pre- 
vious listings that imported and processed the MNIST data. 
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def make_model (number_of_layers=2, neurons_per_layer=32, 
dropout_ratio=0.2, optimizer='adam'): 
model = Sequential() 


# first layer its special, because it sets input_shape 

model.add(Dense(neurons_per_layer, 
input_shape=[number_of_pixels], 
activation='relu', kernel_constraint=maxnorm(3) ) ) 

model.add (Dropout (dropout_ratio) ) 

# now add in all the rest of the dense-dropout layers 


for i in range(number_of_layers-1): 
model.add(Dense(neurons_per_layer, 
activation='relu', 
kernel_constraint=maxnorm(3) ) ) 
model.add (Dropout (dropout_ratio) ) 


# finish up with a softmax layer with 10 outputs 
model.add(Dense(number_of_classes, activation='softmax') ) 


# compile the model and return it 
model.compile(loss='categorical_crossentropy', 

optimizer=optimizer, metrics=['accuracy' ]) 
return model 


Listing 24.6: Our model-making function make_model() takes four 
optional parameters. Two are integers, one is a float, and one is a string. 
It creates as many dense layers as we request, appending a dropout layer 
after each one. 


Listing 24.7 shows how to pass parameters to our new make_model() 
function which takes arguments. There are ways to make this code 
smaller (such as by using Python’s **kwargs technique), but as usual, 
we ve chosen clarity over conciseness. 
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from keras.wrappers.scikit_learn import KerasClassifier 
kc_model = KerasClassifier (build_fn=make_model, 
# parameters for the model-making function 
number_of_layers=2, neurons_per_layer=32, 
optimizer = ‘'adam', 
# parameters for sctkit-learn 
epochs=100, batch_size=256, verbose=0) 


Listing 24.7: We wrap up make_model() inaKerasClassi fier along 
with defaults for all the variables we need to create the model and control 
scikit-learn’s training process. When scikit-learn calls make_model (), it 
will assign the parameters of that function the values we provided when 
we made our KerasClassi fier. 


In effect, the wrapper just takes the values we provide to it and passes 
them to the model-making function arguments of the same name. The 
syntax is a little confusing because it looks like KerasClassifier is 
taking these arguments for itself, but this is a funky bit of Python that 
doesn’t follow the common rules. 


It’s a common convention to assign the wrapped-up object to a vari- 
able called model, but that can be confused with the more typical use 
of model to mean a normal, or unwrapped, neural network. To empha- 
size that this is not a straightforward Keras model, we're calling it 
kc_model for “Keras classifier model.” This way we can speak of the 
wrapped object (the kc_model) and the model that gets created using 
its build_fn function, which we'll continue to simply call a “model.” 


If we hand kc_mode 1 to scikit-learn for cross-validation, make_model() 
will be called and passed the value 2 for number_of_layers, the value 
32 for the argument neurons_per_layer, and the string 'adam' for 
optimizer, just as though we’d assigned them ourselves Since we’re 
not giving a value for dropout_ratio to KerasClassifier, it doesn’t 
assign any value to that parameter, so make_model() will use its default 
value for that argument. 
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If we use this kc_model for grid searching, the searcher can assign its 
own values to any of these parameters when it calls the function to 
make the model. Any arguments we don’t explicitly re-assign will use 
the default values specified when we make the wrapper. 


The last set of three arguments to KerasClassifier() (epochs, 
batch_size, and verbose) are not for our model, but are intended for 
scikit-learn. They get passed to the cross-validator’s fit() routine to 
control the training process. 


In addition to giving parameters for fit(), we can also name param- 
eters that get passed to predict(), predict_proba(), and score(), 
for use if and when those functions get called. 


As long as we keep all of our parameter names distinct, Python will 
correctly pass the desired value to every function involved in the 
cross-validation and grid searching. 


This is a flexible system. For example, it means that we could tune the 
value of batch_size when searching a grid, or the size of our network 
by searching trying different values of number_of_neurons, or differ- 
ent optimizers by trying different strings for optimizer_choice. 


The key thing to remember is that the wrapper is basically remember- 
ing what values should be used for the arguments in the model-making 

function, and it will use those by default. It also remembers a few values 

that get passed on to scikit-learn. As long as the names we're assign- 
ing to in the wrapper match the names in the model-making routine, 
everything will be automatically matched up. 


24.3.2 Cross-Validation 


Let’s use a Keras wrapper to do cross-validation on our recent three- 
layer deep learning system of Figure 24.12 and Listing 24.3, with two 
32-neuron dense layers (each with dropout) and a 10-neuron dense 
output layer. 
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Applying cross-validation may seem pointless. After all, we already 
have an excellent, large testing set. What more are we going to learn 
from cross-validation that we haven’t already seen by using our valida- 
tion data? 


In this case, not much. We should expect the results of cross-valida- 
tion to be very close to what we saw above. 


But having our own high-quality training set coming along with the 
data is a luxury we can’t always count on. Sometimes there is no val- 
idation set. Sometimes we have one but we’re not sure it’s very good. 
For instance, consider a new deck of cards. Ignoring any jokers or 
other cards, one typical arrangement for the new cards is to run from 
ace of hearts to king of hearts, then ace to king in the suit of clubs, then 
again for diamonds and then spades. This is called “New Deck Order’ 
[Caini3]. Suppose someone opened up a new deck and takes away the 
bottom 25%, calling it the validation data. This set is definitely not rep- 
resentative of the rest of the deck, because it contains no red cards of 
any suit, and the remaining cards have no spades. 


9 


Another challenge of validation sets comes when we're working with 
a small version of an original dataset. If this dataset is small, as we 
saw in Chapter 8, cross-validation is a great way to evaluate it without 
making the training set even smaller by making a dedicated validation 
set. 


So although the MNIST data gifts us with a great validation set, we'll 
proceed as though that isn’t the case, so we can see how to approach 
the problem of evaluating the quality of a trained model in the general 
case. 


We'll first do something simple but incomplete, just to get a feeling for 
the process. Then we'll add in the missing step. 


Cross-validation requires training and then validating our entire 
model over and over again with slightly different data. We'll be using 
10 folds, so each session of the cross-validator will take 10 times longer 
than the training sessions earlier in this chapter. 
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Let’s get going, since there’s almost nothing to it. We'll just make our 
model and then run cross-validation with scikit-learn as in Chapter 15. 


As we mentioned above, we’ll repeat our model of Listing 24.7 and use 
2 layers of 32 neurons, each with dropout. We’ll stick with the default 
Adam optimizer, though we could make our own Adam object as before 
and use that here instead. 


To make our model, we'll supply our generalized model-making rou- 
tine make_model() in Listing 24.6 with the parameters that make our 
desired network. 


kc_model = KerasClassifier (build_fn=make_model, 
number_of_layers=2, 
neurons_per_layer=32, 
optimizer='adam', 
epochs=100, batch_size=256, verbose=0) 


Listing 24.8: Placing our model inside of a Keras wrapper. 


In Listing 24.8 we built a Keras wrapper of type KerasClassifier 
with our new routine make_model(). We gave two arguments 
(number_of_layers and neurons_per_layer) that matched the argu- 
ments in make_model(), so they will get passed in when the model is 
built. We also set the parameters that we want to give to fit() when 
it gets called (optimizer, epochs, batch_size, and verbose). Now 
kc_model can be used inside of scikit-learn like any other estimator. 


Before we actually do that, there are a couple of loose ends to clean up. 


One issue is’ that scikit-learn’s cross-validation function 
cross_val_score() doesn’t want the one-hot encoded version of our 
label data. It wants the original versions that contain lists of integers. 
It so happens that we’ve been saving the original labels all this time in 
their own variables, so we have just what the routine needs. What a 
coincidence! 
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The other issue has to do with what data we pass to the cross-valida- 
tion system. As we’ve done before, we'll simply pretend that we don’t 
have a validation set, and treat the training data as if it was our entire 
dataset. We'll let the cross-validator manage the train-validation split 
for us. 


Now let’s get this cross-validation going. There are two tasks to per- 
form. First, we'll make the object that drives the cross-validation 
process. Let’s use our old friend StratifiedKFold() from Chapter 
15, with 10 splits. We'll shuffle the data, and we'll set the optional 
random_state variable to the value of random_seed that we already 
have around. That’s useful for debugging. 


We can use Listing 24.9 to make the StratifiedKFold object. 
from sklearn.model_selection import StratifiedKFold 


kfold = StratifiedKFold(n_splits=10, shuffle=True, 
random_state=random_seed) 


Listing 24.9: Creating the StratifiedKFold object that will build the 
cross-validation training and test sets. 


Now we're ready to go. Using the same techniques that we saw in 
Chapter 15, we just tell scikit-learn to run the cross-validator and track 
the scores by calling cross_val_score() with our model, our training 
data and original labels, and our folding object. Listing 24.10 shows 
the code. 


from sklearn.model_selection import cross_val_score 


results = cross_val_score(kc_model, X_train, original_y_train, 
cv=kfold, verbose=0) 


Listing 24.10: Running cross-validation from scikit-learn, using 
kc_model, our Keras model in a wrapper. 


Putting it all together, we get Listing 24.11. 
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from keras.datasets import mnist 

from keras.models import Sequential 

from keras.layers import Dense 

from keras. layers import Dropout 

from keras.constraints import maxnorm 

from keras import backend as keras_backend 

from keras.utils import np_utils 

from keras.models import load_model 

from keras.wrappers.scikit_learn import KerasClassifier 
from sklearn.model_selection import StratifiedKFold 
from sklearn.model_selection import cross_val_score 
import numpy as np 

random_seed = 42 

np.random.seed(random_seed) 


# load MNIST data and save sizes 

(X_train, y_train), (X_test, y_test) = mnist.load_data() 
image_height = X_train.shape[1] 

image_width = X_train.shape[2] 

number_of_pixels = image_height * image_width 


# convert to floating-point 

X_train = keras_backend.cast_to_floatx(X_train) 
X_test = keras_backend.cast_to_floatx(X_test) 

# scale data to range [0, 1] 

X_train /= 255.0 

X_test /= 255.0 


# save y_tratin and y_test for use when cross-validatting 
original_y_train = y_train 
original_y_test = y_test 


# replace label data with one-hot encoded versions 
number_of_classes = 1 + max(np.append(y_train, y_test) ) 

y_train = to_categorical(y_train, num_classes=number_of_classes) 
y_test = to_categorical(y_test, num_classes=number_of_classes) 


# reshape samples to 2D grid, one line per image 
X_train = X_train.reshape(X_train.shape[0], number_of_pixels) 
X_test = X_test.reshape(X_test.shape[0], number_of_pixels) 


def make_model (number_of_layers=2, neurons_per_layer=32, 
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dropout_ratio=0.2, optimizer='adam'): 
model = Sequential() 
# first layer ts special, because it sets tinput_shape 
model.add(Dense(neurons_per_layer, 
input_shape=[number_of_pixels], 
activation='relu', 
kernel_constraint=maxnorm(3) ) ) 
model.add (Dropout (dropout_ratio) ) 
# now add in all the rest of the dense-dropout layers 
for i in range(number_of_layers-1): 
model.add(Dense(neurons_per_layer, activation='relu', 
kernel_constraint=maxnorm(3) ) ) 
model.add (Dropout (dropout_ratio) ) 
# finish up with a softmax layer with 10 outputs 
model.add(Dense(number_of_classes, 
kernel_initializer='normal', 
activation='softmax')) 
# compile the model and return it 
modeLl.compile(loss='categorical_crossentropy', 
optimizer=optimizer, metrics=['accuracy' ]) 
return model 


# make the model and wrap it up for scikit-learn 

kc_model = KerasClassifier (build_fn=make_model, 
number_of_layers=2, neurons_per_layer=32, 
optimizer='adam', 
epochs=100, batch_size=256, verbose=0) 


# create cross-valtdator 
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_ 


state=random_seed) 


results = cross_val_score(kc_model, X_train, original_y_train, 
cv=kfold, verbose=0) 


print('results = {}\nresults.mean = {}'.format( 
results, results.mean())) 


Listing 24.11: Doing cross-validation with our MNIST data. 


This is a lot of code, but it’s just assembling pieces we’ve already seen. 
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Running this code gives us the output shown in Listing 24.12. 


results =[ 0.95221445 0.95019157 0©.95617397 0.9525 
0.95116667 0©.96166028 0.95999333 0.95415903 
0.95797899 0.95346898 | 

results.mean=0.9549507265070032 


Listing 24.12: Our first results running cross-validation on our simple 
Keras model. 


So the cross-validation run is telling us that on the original dataset of 
60,000 images, we got a performance of a bit more than 95% accu- 
racy. That’s a just about the same as what we saw graphically for this 
model way back in Figure 24.13, where the validation accuracy was 
just a smidge better than 95%. 


That’s reassuring. It says that this whole wrapping and cross-validat- 
ing scheme is producing the same results that we got when we trained 
and tested the model ourselves. 


Although cross-validation didn’t gain us anything in this example, 
since we already had a great validation set, now we know how to eval- 
uate a model if we don’t have such a test set handy. 


24.3.3 Cross-Validation with Normalization 


Earlier we said that something was missing from this process. The thing 
we left out was normalizing the data before each run of cross-validation. 


We got away with it in this case because we already normalized the 
training data to the range [0,1] when we divided it by 255. So when 
cross-validation grabs a random 90% of these samples and trains on 
them, it’s likely to get samples that run from 0 to 1. 


But that’s only because we’ve already normalized our data and things 
are very simple. In general, the data that’s going to get chosen from 
our database and used for cross-validation won’t be normalized to the 
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range [0,1]. It’s up to us to get that normalization in there, and then 
apply that same transform to the part of the data that was set aside for 
testing in that run. 


Happily, that’s easy. We just build a pipeline. 


As we saw in Chapter 15, we can normalize the particular piece of train- 
ing data that’s built for each pass through cross-validation by building 
a Pipeline object composed of two steps: a normalizer followed by 
our model. 


Let’s do this by first making our objects, and then assembling them 
into a Pipeline object. For demonstration purposes our pipeline will 
contain a MinMaxScaler from scikit-learn, followed by our model. The 
MinMaxScaler is attractive because there are no parameters to set or 
options to pick. MinMaxScaler isn’t a perfect choice for this case, since 
it adjusts each pixel independently, which could lead to bright or dark 
spots. But it should be okay for demonstration on this data. Listing 
24.13 shows how to build the pipeline. 


from sklearn.pipeline import Pipeline 

from sklearn.preprocessing import MinMaxScaler 
estimators = [] 

estimators.append(('normalize_step', MinMaxScaler())) 
estimators.append(('model_step'’, kc_model) ) 

pipeline = Pipeline(estimators) 


Listing 24.13: Making a two-step pipeline with named components. 


Constructing a pipeline this way is useful when we want to later refer 
to the individual steps. We'll need to do that soon when we use grid 
searching. 


But for this cross-validation step, we don’t need that kind of access. 
We'll often see code that builds the pipeline in one line, using the short- 
cut make_pipeline() function. Listing 24.14 shows the step. 
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pipeline = make_pipeline(MinMaxScaler(), kc_model) 


Listing 24.14: Making a pipeline using the shorthand notation, without 
names for the individual steps. 


These two pipeline objects are the same. The only difference is that 
we ve given our own names to the steps in the first version. 


To use our pipeline object, we just give it to cross_val_score() in 
place of a model (or wrapped model). Scikit-learn will recognize that 
it’s a pipeline and take care of all the rest. So each time through the 
loop, cross_val_score() will select one of folds as a validation set. 
The rest of the data will be the training set for that run. It will give 
that training data to the MinMaxScaler() (using all the default argu- 
ments). Once the data has been analyzed, the transformation found 
by the MinMaxScaler will be applied to both the current training data 
and validation data. The model will then learn from the training data. 
When training is done, the system will run the transformed validation 
set through the model, predict its categories, compare those to the 
labels, and compute error scores. 


That’s a huge amount of work, all from one function call! That call is in 
Listing 24.15, where we simply replace kc_model1 in Listing 24.11 with 
pipeline. 


results = cross_val_score(pipeline, X_train, original_y_train, 
cv=kfold, verbose=0) 


Listing 24.15: We cross-validate with our pipeline just as we do for a 
model. 


Putting these new lines together, we get a new block of code that 
replaces the last few lines at the end of Listing 24.11. The new code, 
and the output we get from running the process with it, is shown in 
Listing 24.16. 
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pipeline = make_pipeline(MinMaxScaler(), kc_model) 
kfold = StratifiedKFold(n_splits=10, shuffle=True, 
random_state=random_seed) 
results = cross_val_score(pipeline, X_train, original_y_train, 
cv=kfold, verbose=2) 
print('results = {}\nresults.mean = {}'.format( 
results, results.mean() ) ) 


results =[ 0.95454545 0.9508579 0.95534078 0.952 
0.95283333 0.963994 0.95749292 0.95415903 
0.95781224 0.95380253] 

results.mean=0.9552838183370085 


Listing 24.16: Setting up and calling the cross-validator with our pipeline. 


This value of about 0.954 matches our previous average accuracy. 


In this case, the extra work of building the pipeline with the normal- 
izer didn’t pay off with any new benefits or accuracy. That’s probably 
because randomly removing a batch of samples from the training set 
and then normalizing probably had little effect on the samples, since 
they were already normalized. 


But that’s definitely not going to be true for all datasets, and we should 
never take it for granted. Unless we are certain about the input data 
and its statistics, using a pipeline and processing our data is usually 
worth the extra effort on our part. It takes little extra computing time 
to compute and apply most common transformations, compared to 
training and testing. 


We picked a MinMaxScaler here pretty much arbitrarily, but as we 
know, different data sets require different types of pre-processing. 
Using the pipeline mechanism, we can apply whatever steps we need. 


Cross-validation is a great way to get a handle on the quality of our 
model. It’s not so great when training times start to push our patience, 
since every fold is essentially a brand-new full-length training and 
testing process. Using 10 folds requires training and then testing our 
model 10 times in a row. 
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The time required can add up fast. But if we don’t have a good val- 
idation set, then running cross-validation tests on small bits of our 
database, using different parameters, can teach us a lot about what’s 
going on in our data. That knowledge can in turn help us design an 
efficient larger network to process the whole database. 


24.3.4 Hyperparameter Searching 


This section’s notebook continues 
Keras-Notebook-10-scikit-learn.ipynb 


We've been using some arbitrarily hand-picked numbers in this chap- 
ter. For instance, we’ve settled on an architecture with 2 layers of 32 
neurons each, but not for any particular reason. 


Going forward, we'll often refer to “parameters” rather than the more 
awkward “hyperparameters.” As well as being more readable, his 
makes sense when we consider that many of these values are provided 
to the system as parameter, or arguments, of functions. 


We can use the grid searching algorithms offered by scikit-learn to 
help us out. With those routines, we can automatically try out all the 
different combinations of multiple settings for multiple parameters. 
We could do this ourselves with some nested loops, but it’s easier to 
relax and let scikit-learn do the driving. 


The grid searching object GridSearchCV will try out every combination 
of the parameters we give it, and measure each model’s performance 
using cross-validation. By default, it uses 3 folds to save time, but we 
can increase that with an optional argument. 


We think of this as “searching” because we imagine that each combi- 
nation of parameters is a point in some very high-dimensional space, 
called the search space. Each point in search space represents some 
combination of parameters, and the value of that combination (that 
is, the accuracy or loss that results from training a model with those 
parameters) is the value associated with that point. The intuition is 
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that we’re searching through this space, wandering from point to 
point and region to region, looking for the point that has the highest 
performance. 


Figure 24.16 shows an example of a search space with two dimensions. 
The idea really comes into its own when we're dealing with many more 
dimensions. Though we can’t visualize them, we can make the analogy 
to something like Figure 24.16 and talk about two sets of parameters 
being close or far apart, and even talk about regions of the space where 
it looks like we’re finding good results. 


parameter 2 


A 
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Figure 24.16: A two-dimensional search space. For each pairing of values 
of two parameters, the size of the circle shows the quality of the result, 
with larger circles better than smaller ones. It looks like there’s a small 
but high-quality region in the lower left, and two other promising regions 
in the upper right and upper left. Now that we know where the good 
results can be found, we can investigate those areas with a finer resolu- 
tion to find the best combination of parameters. 


As we discussed earlier, often we'll use just a subset of our training 
data when searching, so that it runs more quickly. When we’ve found 
the best values for our model’s parameters using this smaller database, 
we can use a larger version and work our way up to the full dataset. 
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To keep things simple, for this discussion we'll continue to use MNIST, 
and the whole X_train database while searching. 


We'll want to use our normalizing pipeline again, since in general that 
will be necessary, or at least a very good idea. 


As we saw in Chapter 15, when we prepare a pipeline for grid search- 
ing, we need to tell Gr idSearchCV where each parameter it’s searching 
through ought to be routed. This means we need to identify differ- 
ent steps in the pipeline. That’s easy if we use the pipeline-building 
method of Listing 24.13, where we give a name to each step. 


As we saw in Chapter 15, referring to parameters inside pipelines is 
somewhat baroque. Let’s recap briefly. 


We build a dictionary, where each key is the name of a parameter to 
one of the steps in our pipeline, and its value is a list of all the val- 
ues we’d like to explore. Each name is formed by combining the step 
name in our pipeline and the name of the parameter with two under- 
score characters, as in step__parameter. With some typefaces the two 
underscores look like just one big underscore, which is unfortunate, 
but that’s how it is. For comparison, here is one_underscore and here 
are two__underscores. 


Let’s build a dictionary to search through three of our model’s param- 
eters: the number of dense layers (each with dropout), the number of 
neurons per dense layer, and, just for curiosity’s sake, two different 
optimizers. Listing 24.17 shows this dictionary. Note that the keys are 
not strings. 


param_grid = dict(model__number_of_layers=[ 2, 3, 4 ], 
model__neurons_per_lLayer=[ 20, 30, 40 ], 
model__optimizer=[ 'adam', ‘adadelta' ]) 


Listing 24.17: A dictionary of parameters that wed like to use for searching. 
Each key is a parameter named by gluing together the name of the pipe- 
line step with the name of its parameter, with two underscores in between. 
Each value is a list of settings to be tried. 
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We can use our dictionary with the pipeline we created above in Listing 
24.14 to create our searching object, as in Listing 24.18. 


grid_searcher = GridSearchCV(estimator=pipeLline, 
param_grid=param_grid, verbose=2) 


Listing 24.18: Creating a GridSearchCV object that will go through our 
parameter grid, assemble a model for every combination of options, and 
cross-validate that model. 


Now we're ready to roll. We just call the searcher’s fit() routine with 
our data, and let it run. 


Listing 24.19 puts together the searching code. We’re going to suffix 
each variable with a 1 because we’re going to run another grid search 
below, which will be version 2. 


from sklearn.model_selection import GridSearchCv 


param_gridl = dict(model__number_of_layers=[ 2, 3, 4 ], 
model__neurons_per_Layer=[ 20, 30, 40 ], 
model__optimizer=[ 'adam', '‘adadelta' ]) 
grid_searcherl = GridSearchCV(estimator=pipeLline, 
param_grid=param_grid1, verbose=2) 
search_resultsl = grid_searcherl.fit(X_train, original_y_train) 


Listing 24.19: Combining the grid construction, GridSearchCVv object 
construction, and then calling fit() to run the search. 


Warning! Grid searches are slow. 


The total number of full 3-fold cross-validation runs it will perform is 
reported by the searcher as soon as it starts up. Listing 24.20 shows 
what we'd see for the search we just defined. 


Fitting 3 folds for each of 18 candidates, totaling 54 fits 


Listing 24.20: As this output from the grid-search fit() shows, this 
exhaustive cross-validation will call fit () on our model 54 times. 
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The number 54 comes from multiplying the number of folds (3 by 
default) by the number of combinations of variables. In our case, we 
have three lists of variables, with lengths 3, 3, and 2. The total num- 
ber of possibilities is found by multiplying these together: 3x3x2=18. 
Since each possibility has to go through three steps of cross-validation, 
we will call fit() a total of 3x18=54 times. 


If it takes a minute to train and evaluate the model, it will take about 
an hour to run this search. 


The variable search_results1 we get back contains a lot of infor- 
mation. One of the objects in search_results1 is a dictionary called 
cv_results_ (recall that all of scikit-learn’s internal variables are 
suffixed with an underscore). The cv_results_ dictionary contains 
detailed information on the cross-validation results. 


Since were interested in finding the best combination of parame- 
ters, two dictionary items are of particular interest. The 'params' 
item tells us which set of parameters corresponds to each score. The 
'mean_test_score' item tells us the average value that came out of 
the cross-validation for each set of parameters. 


Let’s look first at the 'params' entry, shown in Listing 24.21. 


search_resultsl.cv_results_['params' ] 


{'neurons_per_layer': 20, 'number_of_layers': 2, ‘optimizer’: ‘adam'}, 
{'neurons_per_layer': 20, 'number_of_layers': 2, ‘optimizer': '‘adadelta'}, 
{'neurons_per_layer': 20, 'number_of_layers': 3, ‘optimizer’: 'adam'}, 
{'neurons_per_layer': 20, 'number_of_layers': 3, ‘optimizer’: '‘adadelta'}, 
- 10 lines manually deleted . 

{'neurons_per_layer': 40, 'number_of_layers': 3, ‘optimizer’: '‘adam'}, 
{'neurons_per_layer': 40, 'number_of_layers': 3, ‘optimizer’: 'adadelta'}, 
{'neurons_per_layer': 40, 'number_of_layers': 4, ‘optimizer’: '‘adam'}, 
{'neurons_per_layer': 40, 'number_of_layers': 4, ‘optimizer’: 'adadelta’ } 


Listing 24.21: The contents of search_results1l.cv_results_ 
['params']. The prefix model_step__ before each parameter has 
been removed to make the list fit and easier to read. We also removed 10 
lines in the middle of the output. 
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We can see that the searching algorithm checked every combina- 
tion, but it proceeded in a different order than in our param_grid1 
dictionary. The outer loop ran through the 3 values of neurons_per_ 
layer, the loop nested inside of that ran through the 3 values of 
number_of_layers, and finally the innermost loop ran through the 2 
values of optimizer. Because Python dictionaries are not guaranteed 
to return their results in any particular order, we won’t be able to pre- 
dict in what order the search will proceed before we run it. 


The numerical data that describes our cross-validation test perfor- 
mance is in mean_test_score, as shown in Listing 24.22. 


search_resultsl1.cv_results_['mean_test_score'’ | 


array([ 0.92901667, ©.91761667, 0©.93081667, 0©.91146667, 
©.92288333, ©.90051667, 0.9472 , 0.93373333, 
©.94561667, ©.93166667, 0©.94333333, 0.92621667, 
©.95566667, 0.9424 , ©.95376667, 0.94185, 
©.95413333, 0.9402 ib. 


Listing 24.22: The cross-validation scores for our search. 


Using NumPy’s utility argmax() we can find the index of the largest 
value in this list, and then extract the corresponding element from the 
'params' item, so we can see which set of parameters gave us the best 
score. Listing 24.23 shows this step. 
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best_indexl = np.argmax( 
search_resuLltsl.cv_results_['mean_test_score' ]) 

print('best set of parameters:\n index {}\n_ {}\n'.format( 
best_indexl1, 
search_resultsl.cv_results_['params'][best_index1]) ) 


best set of parameters: 
index 12 
{'model_step__optimizer': 'adam', 
'model_step__neurons_per_lLayer': 40, 
'model_step__number_of_layers': 2} 


Listing 24.23: A snippet of code that finds the best test score from our 
cross-validation results, and prints the parameters that gave us that score. 
The output was slightly reformatted to fit better. 


So our best combination used 2 layers, with 40 neurons per layer, and 
the Adam optimizer. But how much better was this than the other com- 
binations? Let’s plot all the values of mean_test_scores So we can see 
how every combination performed, as in Figure 24.17. 
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Figure 24.17: The values of mean_test_scores resulting from searching 
for the best cross-validation results for every combination of the param- 
eters in our dictionary of Listing 24.19. The best accuracy came from 2 
layers of 40 neurons each, using the Adam optimizer. 


Guided by the labels, we can interpret this graph as having three major 
sections, one each for 20, 30, and 40 neurons. Within each section we 
have three pairs, one pair each for 2, 3, and 4 layers. Finally, each pair 
of values shows the performance for Adam and then Adadelta. 


Each of these innermost pairs slopes downwards, so we can say that 
Adam consistently outperformed Adadelta. 


The general trend in each second-level group is also downwards, so 
adding more layers usually caused a loss of performance. 


The general trend in the largest groups is going upwards, suggesting 
that more neurons are better than fewer. 
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As Listing 24.23 reports, the best set of parameters used 2 layers of 40 
neurons each, optimized with Adam. 


It’s interesting to note that the worst performance by far was the result 
of 4 dense-dropout layers of 20 neurons each. So that’s a structure to 
avoid for this data. 


Keep in mind that we’re always referring to “layers” as our combina- 
tion of dense and dropout layers, using our default dropout rate of 0.2. 


Let’s search around this area of the parameter space. Since it seems 
fewer layers are working better than more, let’s try searching for mod- 
els with 1 or 2 layers. And since more neurons are performing better 
than fewer, we'll try going up from 40 and look at some larger values. 


We could explore other optimizers, but this time around we'll stick 
with Adam. 


There’s no hard and fast rule for making these choices. We need to use 
our judgment based on our knowledge of our model and data, coupled 
with the results of our experiments, to guide our search strategy. If 
we search with too fine a grid we can waste a lot of time, but if we use 
too coarse a grid we could miss a big spike in performance. Generally 
speaking, searching for performance is a task that rewards both intu- 
ition and analysis. 


Listing 24.24 shows the dictionary for our second search. 


param_grid2 = dict(model_step__number_of_layers=[ 1, 2], 
model_step__neurons_per_layer=[ 
50, 80, 110, 140, 170 ]) 


Listing 24.24: The dictionary for our second parameter search. 


The results are plotted in Figure 24.18. 
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Figure 24.18: Searching 1 and 2 layer networks for layers with increasing 
numbers of neurons. The results alternate with the 1-layer version, then 
the 2-layer. 


This tells us that more neurons keep working better, suggesting that 


our hunch was right. And 2-layer models are looking consistently bet- 
ter than 1-layer. 


Let’s crank up both the number of neurons and the search range quite 
a bit. Listing 24.25 shows the dictionary for our third search. 


param_grid3 = dict(model_step__number_of_layers=[ 1, 2 ], 


model_step__neurons_per_layer=[ 
180, 280, 380, 480, 580 ]) 


Listing 24.25: The dictionary for our third parameter search. 


The results are in Figure 24.19. 
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Figure 24.19: Searching 1 and 2 layer networks for layers with large 
numbers of neurons. The results alternate with the 1-layer version, then 
the 2-layer. 


Figure 24.19 shows us that the best performance came from 2 layers of 
280 neurons each. As we added more neurons, performance started to 
decline, though slowly. Perhaps this is due to overfitting, though we’d 
have to look more closely to be sure. 


Notice the size of the vertical scale. Our first graph in Figure 24.17 
showed an improvement of about 0.055 from the worst performer to 
the best, while in our most recent graph Figure 24.19 the difference is 
only about 0.0035, which is about 1/15 the size. 


The overall feeling of the curve, though it’s jumping around when we're 
zoomed in so much, seems to be flattening out. We might be pretty 
close to the best choice for this set of parameters. 


We'll stop searching here, but we could continue to try different val- 
ues for all of these parameters, or some we didn’t even try (like any 
of the Normalization object’s parameters, or dropout_ratio in our 
own model). 
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The time required to do a full grid search can quickly become impracti- 
cal, because the searcher exhaustively tests every possible combination 
of the search parameters. So it’s always a good idea to use the smallest 
number of combinations we can get away with, and to search with the 
smallest amount of data that will give us a reasonable prediction of the 
model’s performance. 


A good strategy is to start with searches that cover broad ranges with 
just a few values. When we see where the model is performing best, 
we can then run another, denser search to explore the area around 
that zone. This is called multiresolution searching, and it’s just an 
algorithmic version of what we do when we look for something in the 
real world. Say we’re looking for a book in the library. We have a call 
number, so we wander around to the right section of the library, then 
the right stack, the right shelf, and so on, using a series of ever-nar- 
rower searches until we find our book. 


We do the same thing with ever-smaller searches until we find a com- 
bination of parameter values that work the best. 


A useful alternative to the exhaustive search performed by the grid 
searcher is provided by scikit-learn’s RandomizedSearchCVv algorithm. 
As discussed in Chapter 15, this variation on the grid search picks an 
unexplored random combination of the search parameters on each 
run. We could for instance just search a third of the total combina- 
tions. We’d get our answer back 3 times sooner than a complete grid 
search, but it would be incomplete. It would be incomplete in a nice 
way, though, giving us a roughly equal scattering of points throughout 
the parameter space. This might be enough to guide our choice for a 
smaller, more focused search. 
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24.4 Convolution Networks 


This section’s notebook is 
Keras-Notebook-11-CNN.ipynb 


Let’s build some convolutional neural networks, also called con- 
vnets, or more commonly, CNNs. 


In Chapter 21 we surveyed some of the remarkable things we can do 
with convnets. Recall that each convolution layer holds a collection of 
filters, or kernels, which are rectangles of numbers (often a small 
square that is 3, 5, or 7 elements on a side). When we use 2D convolu- 
tion layers with images, each filter in the first layer is applied in turn 
to every pixel in the input. The output of the filter becomes the value 
of that element in a new tensor produced at the layer’s output. If there 
are multiple filters, then the output tensor contains multiple chan- 
nels, just like the red, green, and blue channels of a color image. 


Although the original input to the first convolution layer of a model is 
often an image the data that flows through the network, we generally 
don’t think of it that way. If a layer has 32 filters, for example, then the 
output will have 32 channels. It may have the same width and height 
as the input (though we'll see those measures often change as well), 
but it’s not really an “image” any more. So while it’s tempting to casu- 
ally speak about convolution layers as processing “images” made of 
“pixels,” it’s a better idea to refer to them as tensors, made of elements. 


Recall from Chapter 21 that we can characterize a convolution layer by 
the number of dimensions in which the filters are moved. If the filter is 
moved in just one dimension (for example, down), then we call it a 1D 
convolution layer. Typically, when we work with images, we slide our 
filters over the 2D width and height of the tensor, so we usually use 2D 
convolution layers for image processing. Keras also offers 3D convolu- 
tion layers for working with volumetric data. 
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In this chapter, we'll focus on images, so we'll be using 2D convolution 
layers. 


In the following sections we'll see how to build and train our own 
CNNs. As we'll discuss later, in practice we don’t often build and train 
a new CNN from the ground up. Instead, we usually try to start with 
an existing network whenever possible, and specialize it for our task 
by perhaps modifying it, and then training it some more with our own 
data. Such transfer learning is appealing because we get to start 
with an existing architecture that is known to work well, and we save 
the time (sometimes days or weeks) that was invested in training the 
model we’re building upon. We also get the benefit of the data that 
network was trained on, which might not be available to us. 


But it’s important to know how to build our own from scratch. This 
lets us start fresh when we need to, and gives us the tools to modify an 
existing network when we want to. Whether we're working with our 
own model or one we’ve adopted, knowing what’s going on inside will 
help us diagnose problems and get the best performance out of our 
model. 


Let’s begin by setting up a few basic ideas that will be helpful when we 
build our convnets. 


24.4.1 Utility Layers 


We'll briefly recap some of the utility layers that we saw in Chapter 20, 
focusing on those that are useful in convnets. 


Like the dropout layer, these aren’t fully-fledged computational lay- 
ers. Instead, they’re often “informational” layers that tell Keras how 
to process or manipulate data that flows through the layer, or how to 
affect a previous layer. Keras structures these as layers so that we can 
think of our model consistently as a stack of layers. 
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Most of the layers described below are available in 1, 2, and 3 dimen- 
sional forms. When working with images, we almost always use the 2D 
versions, so that’s all we'll cover here. 


A recap of our schematic symbols for the major types of layers is shown 
in Figure 24.20. 


He + ¢ 


Dense Convolution Recurrent Sequences Flatten Reshape 


tO REY 


Zero Average Max 
Dropout Batchnorm Noise Pad Pool Pool Upsample 


Figure 24.20: A repeat of a figure Chapter 20, showing the schematic 
symbols for different layers. 


A flatten layer takes a tensor of any number of dimensions and lines 
up all of its contents into a single one-dimensional list. It always does 
this in the same order, so we can predict where each element in the 
tensor will appear in that list (we usually don’t care what in what order 
the elements are listed, as long as it’s consistent from one sample to 
the next). Keras calls this layer Flatten. 


A pooling layer looks at elements that make up a block in the input, 
calculates a single value from them, and saves that single value to the 
output in place of all the input elements. The most common use of 
pooling is to reduce the size of its input. For example, if the blocks 
are 2 by 2, and they don’t overlap, the output will be half the width 
and height of the input. Keras offers two types of pooling layers. The 
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MaxPooling2D layer finds the largest value in each block. We can tell 
it the size of the block to use, and the stride, or how many elements 
to move horizontally and vertically after each block. Commonly we 
use blocks that are 2 by 2 with a stride of 2 in each dimension. The 
AveragePooling2D layer works the same way, but instead computes 
the average value of each block. 


As we discussed in Chapter 21, pooling layers after convolution lay- 
ers are falling out of favor, replaced by striding within the convolution 
layer itself to achieve a similar result. The striding and convolution 
approaches produce similar but different results. Usually we don’t care 
about that difference, but sometimes it matters, so pooling layers are 
still sometimes used. 


A cropping layer removes a tensor’s outermost elements, leaving just 
the inner rectangle. The Keras layer called Cropping2D takes argu- 
ments which let us describe how many elements to remove from each 
of the four sides. 


An upsampling layer is designed to make the input tensor larger. 
Each element is just repeated horizontally and vertically by the given 
number of times. Keras calls this layer UpSampling2D (note the upper- 
case S in the middle of the name). 


As we mentioned in Chapter 21, an alternative to an explicit upsam- 
pling layer after a convolution layer is to use transposed convolution 
(or fractional striding) in the convolution layer itself. Like normal 
striding and max pooling, transposed convolution produces a similar 
but different result compared to upsampling. 


A batchnorm layer performs regularization on each batch of data 
flowing through it, giving it an average of 0 and a standard deviation 
of 1. This helps keep weights from growing too large. 


A noise layer adds some random noise to every element in the ten- 
sor. This is rarely used, but can be helpful if some neurons seem to be 
overly-aggressive in matching specific features that are not ultimately 
important. 


1262 


Chapter 24: Keras Part 2 


Finally, a zero padding layer places o’s around the perimeter of 
the input. This is usually so that convolution kernels will not “fall 
off the edge” and try to access non-existent data. Keras calls this 
the ZeroPadding2D layer (note the upper-case P). Because Keras 
now offers this feature in the convolution layers themselves, explicit 
zero-padding before convolution is rare in Keras 2 models. 


24.4.2 Preparing the Data for A CNN 


We'll continue to use MNIST for our example data set. 


We prepare our MNIST data for a convnet with almost the same pro- 
cess as we’ve been doing so far when the first layer was a Dense, or 
fully-connected, layer. 


The difference is in the shape of the feature data. So far, we’ve been 
shaping our feature data in a 2D grid, with one row per image. Each 
row held all the pixels for that image. 


Something important happened when we flattened out our image 
to make that grid: we lost the spatial information that tells us which 
pixels are near one another vertically (technically, it’s still there, but 
definitely not in a structure that’s easily useful). A great thing about 
CNNs is that they work with inputs as multidimensional tensors, not 
long 1D lists. For instance, the receptive field for a filter covers a group 
of spatially-related elements. 


When working with CNNs there’s no need to flatten out input 2D grids 
of pixels. We'll maintain them instead as three-dimensional volumes, 
where each input image has a height, width, and depth. 


One important use of this third dimension is to bundle together the 
channels of data that represent color images. A typical digital color 
image has three channels, one each for red, green, and blue. So if we 
stack these up, we'll get a block of 3 layers. On the other hand, a picture 
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that has been prepared for printing usually has four channels: cyan, 
magenta, yellow, and black. This requires a block of 4 layers. Figure 
24.21 shows this idea. 





Figure 24.21: Multiple 2D grids of pixels are often used to represent 
richer types of images. (a) A typical digital image in color uses three chan- 
nels, one each for red, green, and blue. (b) A file prepared for printing may 
have four channels, for cyan, magenta, yellow, and black. 


The MNIST data is black and white, so we have just a single channel of 
pixel data. But we still have to explicitly tell Keras that we have just that 
one channel, by making it one of the dimensions of our input tensor. 


The order in which we name our dimensions depends on whether we’re 
using the channels_first or channels_last option, as we discussed 
at the start of Chapter 23. We'll continue to use channels_last here, 
stacking up our images from front to back as in Figure 24.21. Recall 
that we count the dimensions in the order away, then down, then right. 


By adding a channel dimension, each MNIST image will become a 3D 
block with dimensions 1 by 28 by 28. Our input data structure will 
contain 60,000 of these 3D blocks. That means the complete tensor 
will have a first dimension of 60,000, followed by the shape of each 
image. 


Using the channels_last convention, this tensor will have dimen- 
sions of 60,000 by 28 by 28 by 1. 


We can’t draw a 4D tensor, but we can show lots of 3D tensors in a list. 
Figure 24.22 uses that approach to picture our dataset. 
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Figure 24.22: Each image in the input will be reshaped as a 3D block 
with dimensions 28 by 28 by 1, since we're using the channels_last 
convention. Then we stack all 60,000 of these 3D blocks together to 
make a 4D tensor of shape 60,000 by 1 by 28 by 28, which will serve as 
input to our CNN. 


As we discussed earlier, it can be easier to think of this not as a 4D 
structure but instead as a sequence of nested lists: the outermost list 
contains 60,000 images, each image contains one channel, each chan- 
nel contains 28 rows, and each row contains 28 elements. 


This means that each pixel is named with four numbers: the image 
number, channel number, y position, and x position, in that order. 


Convnets work best with input data scaled from —1 to 1 [Karpathy16b]. 
This means we can’t just divide every pixel by 255. Instead, we'll use 
the NumPy function interp() to convert each input value in the range 
[0,255] to the range [—1,1], shown in Listing 24.26. 


X_train = np.interp(X_train, [0, 255], [-1,1]) 
X_test = np.interp(X_test, [0, 255], [-1,1]) 


Listing 24.26: Using NumPy’s interp() routine to convert all the input 
values from [0,255] to [-1,1]. 


Now we'll re-shape the data into the shape we just discussed. We just 


tell NumPy how to take our original version of X_train, which was 
60,000 by 28 by 28, and reshape it into a 4D tensor that’s 60,000 by 
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1 by 28 by 28. We’re not changing the total number of elements, so 
NumPy can do this for us. We just hand it the dimensions we want it 
to use. Listing 24.27 shows the code. 


# reshape sample data to 4D tensor using channels_last convention 
X_train = X_train.reshape(X_train.shape[0], 

image_height, image_width, 1) 
X_test = X_test.reshape(X_test.shape[0], 

image_height, image_width, 1) 


Listing 24.27: How to transform our input MNIST data into the 4D tensor 
that our CNN expects. 


We'll place these re-shaping lines right after the scaling step. For com- 
pleteness, Listing 24.28 shows all the pre-processing in one place. 
This includes all the import statements we'll be needing going for- 
ward. Aside from the import statements, and the final re-shaping, this 
pre-processing is identical to what we’ve been doing so far. A CNN is, 
after all, just another deep neural network, but jazzed up with some 
new types of layers. 


We're assuming that the channels_last option has been selected 
in the Keras configuration file. If that’s not the case, either change 
the file, or import the backend and include a call to set the value of 
image_data_format, as we Saw in Chapter 23. 


from keras.datasets import mnist 

from keras.models import Sequential 

from keras.layers.core import Dense, Dropout, Activation, Flatten 
from keras.layers.convolutional import Conv2D, MaxPooling2D 
from keras.constraints import maxnorm 

from keras.optimizers import Adam, SGD, RMSprop 

from keras import backend as keras_backend 

from keras.utils import np_utils 

from keras.preprocessing.image import ImageDataGenerator 
from keras.utils.np_utils import to_categorical 

import numpy as np 

random_seed = 42 

np.random.seed(random_seed) 
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# load MNIST data and save sizes 

(X_train, y_train), (X_test, y_test) = mnist.load_data() 
image_height = X_train.shape[1] 

image_width = X_train.shape[2] 

number_of_pixels = image_height * image_width 


# convert to floating-point 
X_train = keras_backend.cast_to_floatx(X_train) 
X_test = keras_backend.cast_to_floatx(X_test) 


# scale data to range [-1, 1] 
X_train = np.interp(X_train, [0, 255], [-1,1]) 
X_test = np.interp(X_test, [0, 255], [-1,1]) 


# save original y_train and y_test 
original_y_train = y_train 
original_y_test = y_test 


# replace label data with one-hot encoded versions 
number_of_classes = 1 + max(np.append(y_train, y_test) ) 

y_train = to_categorical(y_train, num_classes=number_of_classes) 
y_test = to_categorical(y_test, num_classes=number_of_classes) 


# reshape sample data to 4D tensor using channels_last convention 
X_train = X_train.reshape(X_train.shape[0], 

image_height, image_width, 1) 
X_test = X_test.reshape(X_test.shape[0], 

image_height, image_width, 1) 


Listing 24.28: The pre-processing step for our CNNs to categorize MNIST 
data. 


Shaping the feature data into these 4D tensors is a necessary pre-pro- 


cessing step. It puts the data into the structure that is expected by the 
convolution layer that will sit at the start of our convnet. 
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24.4.3 Convolution Layers 


Let’s look more closely at how we define a convolution layer. We saw 
that Keras offers convolution layers in 1, 2, and 3 dimensions. We'll 
pick the 2D version, since that’s what we usually use for image data 
like our running example of MNIST digits. 


The layer has the name Conv2D, and we access it by importing it from 
the module keras. Layers. convolutional. 


The Conv2D layer takes two unnamed, mandatory arguments at the 
start of its argument list, followed by a variety of optional arguments. 


The first mandatory argument is an integer specifying the number of 
filters the layer should manage. Recall from Chapter 21 that each filter 
is applied to the input independently, and produces its own output. So 
if our input has one channel (as our input does), and we use 5 filters 
in a convolution layer, the output will have 5 channels, as shown in 
Figure 24.23. 
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Figure 24.23: If we have a 1-channel input and 5 filters, the output tensor 
will have 5 channels, one for the output of each filter. Here we show the 
operation of the 5 filters on a single element of the input, producing an 
output with 5 channels. 
































The second argument to Conv2D is a list that gives the dimensions of 
the filters on this layer. In Keras, as in many libraries, all of the filters 
in a given layer are of the same size, so we only specify one filter size 
for the entire layer. Continuing our example from before, if we have 
5 filters and each is 3 by 3, then these arguments would be 5, [3,3]. 
This tells the layer to automatically allocate and initialize 5 volumes, 
each of shape 3 by 3 by 1 (the trailing 1 is the number of channels). 
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In practice, we almost always use square kernels, often of 3 or 5 ele- 
ments on a side. Experience has shown that these sizes, coupled 
with reduction in the size of the input (either by pooling or convolu- 
tion striding), represents a good tradeoff of computation and results. 
Larger kernels are sometimes used, but they’re not as common. 


Keep in mind that these filter kernels are 3D volumes, since there’s 
one channel in the kernel for each channel in the input. For example, 
suppose that our input image to the first layer is a color image, and 
that our layer is using 5 by 5 kernels. Since there are three channels, 
each filter is created as a block that is 5 by 5 by 3. This block is moved 
over the image in 2 dimensions (hence the name Conv2D), and at each 
element the 75 values from the input are multiplied with their corre- 
sponding 75 entries in the kernel, the results are all added together, 
and that’s the output of this kernel at this element, as shown in Figure 
24.24. 


[| 


T 














Figure 24.24: Each filter automatically holds as many channels as there 
are in the input. Here a5 by 5 filter is being applied to a 3-channel input, 
so the system automatically gives the filter 3 channels as well. Each of 
the 75 values in the input (bottom) is multiplied by its corresponding 
value in the filter (middle), and all of those products are added together 
to produce a single number (top), the output of that filter for that loca- 
tion of the input. 
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If we have several of these filters in a given convolution layer, then 
we'll produce several outputs, just as in Figure 24.23, except now for a 
multi-channel input. Figure 24.25 shows the idea. 























Figure 24.25: If we apply several filters to a multi-channel input, then 
each filter will also have multiple channels. The number of channels in 
the output is given by the number of filters that were used. 


Keeping track of all of these shapes would be an administrative 
challenge, so Keras manages them for us. As a result, we can create 
sequences of convolution layers doing nothing more than telling Keras 
how many filters we want to use on each layer, and what their footprint 
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should be. Keras keeps track of the number of channels coming from 
the previous layer, and makes the filters of the necessary size, with no 
effort from us. 


For example, let’s suppose that we’ve made a convolution layer with 5 
filters, each 5 by 5. Then every output it produces will have 5 channels. 
If the next layer is also a convolution layer, and we say that we want 2 
filters that are 3 by 3, Keras will automatically know to make each filter 
5 channels deep, since that’s what’s coming out of the previous layer. 
In short, the number of channels in each filter is equal to the number 
of channels in the input, which in turn is the number of filters used in 
the previous convolution layer. Figure 24.26 shows this idea visually. 
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Figure 24.26: When one convolution filter follows another, the filters in 
the second layer are automatically configured to have as many channels 
as there are channels in the preceding layer. 
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This automatic sizing of filters is one of the great pleasures of using a 
library such as Keras. We don’t have to manage any of this book-keep- 
ing ourselves, because the library already has everything it needs to 
know to always keep things straight. 


Let’s make a convolution layer with 15 filters, each 3 by 3. Listing 24.29 
shows the code for making the layer and placing it into a model. 


convoLlution_layer = Conv2D(15, (3, 3)) 
model.add(convolution_layer) 


Listing 24.29: How to make a convolution layer with 15 filters that are 
each 3 by 3. 


In practice, we usually do this in one step, as we did with other layers 
earlier in the chapter. The single line that’s usually used for this job is 
shown in Listing 24.30. 


model.add(Conv2D(15, (3, 3))) 


Listing 24.30: The more usual way to add a convolution layer to a model. 


The Conv2D layer accepts many optional arguments, all of which are 
described in the documentation. We'll discuss just the ones that we'll 
be needing. 


We'll start with the arguments that are unique to convolution layers, 
and then touch on the ones we’ve already seen when using Dense lay- 
ers. For convenience, we'll continue to refer to input tensor elements 
as “pixels,” though as we discussed earlier that doesn’t really make 
sense past the first layer. 


A nice optional feature of the convolution layer is that we can include 
zero-padding inside the layer itself if we want it, rather than building 
an explicit zero-padding layer into the model. Even better, Keras can 
automatically compute how much zero padding we need if we want 
our output to have the same shape as our input. It uses the sizes of 
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the filters, and our striding choices if necessary, and adds enough 0’s 
around the outside to guarantee that the filters will never “fall off the 
edge” of the input. 


To apply zero padding, we set the optional argument padding to the 
string 'same', meaning “make the output the same size as the input.” 
The default value of padding is the string 'valid', which means “only 
place the filter where there is valid data available.” This is a long- 
winded way of saying, “no padding.” 


We saw in Chapter 21 that striding allows us to move the filter by any 
distance on each step. We set the stride amounts with the optional 
parameter strides. This accepts a list of 2 numbers, giving the num- 
ber of pixels to move horizontally and vertically. This list defaults to 
(1,1). For example, if we set the stride values to (2,2), then our out- 
put will be half the width and height as the input. Note that number 
of output channels is not affected by the stride, since that comes from 
the number of filters. 


As a little convenience, we can set strides to just a single number, 
and it will use that for both directions. So instead of the list (2,2), we 
can just give the single value 2. 


The last option we'll look at now is activation. Just like our previ- 
ous discussion of Dense layers, every value produced by a convolution 
layer goes through an activation function. The default is a linear func- 
tion, which in effect does nothing. We can set an activation function 
by providing its name as a string. All of the functions we discussed 
in Chapter 17 are available, plus several others (see the Keras docu- 
mentation). Common choices for hidden convolution layers are 'relu' 
and ‘tanh’. 


Recall that a Dense layer required an argument to input_shape if and 
only if that layer was the first layer in the network. Convolution layers 
work the same way, and require a value to be assigned to input_shape 
if they’re the very first layer. 
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The input_shape argument we applied to Dense layers was a list with 
a single value: the length of the list that represented an image. Just as 
with those layers, a CNN layer wants the input_shape to describe not 
the shape of the whole data set, but just one sample. We know that 
in our MNIST example we have 60,000 samples, each 28 by 28 by 1. 
Therefore, the value of input_shape is the list (28,28,1), describing 
one image. 


Now that we’ve covered all the background, let’s construct a 2D con- 
volution layer. This will be the first layer in our model, so we need the 
input_shape argument. Let’s say that we want 16 filters, each 5 by 5. 
For the sake of demonstration, we'll pick the relu activation function, 
zero-padding so that the output is the same size as the input, and a 
stride of 2 in both X and Y. Listing 24.31 shows how we’d add this to a 
model named model. 


model.add(Conv2D(16, (5, 5), activation='relu', 
strides=(2, 2), padding='same', 
input_shape=(image_height, image_width, 1))) 


Listing 24.31: Adding a 2D convolution layer to our model. The first 
argument is the number of filters, followed by the width and height of 
each filter. The other, named arguments (except for input_shape) are 
optional. Here, we set the activation function to reLu, choose zero-pad- 
ding to make the output the same sizes as the input by setting padding 
to same, set the stride to (2,2), and we specify the shape of the input 
tensor using the channels_last convention. 


After all that discussion, the mechanics end up to be pretty short, even 
with the optional arguments. 


That covers the basics of using convolution layers. Everything else is 
just like before: we create our model, add in layers, compile it, and 
then train it. Later we can ask it for predictions. 


Keras takes care of all the work of creating filter kernels of the right size, 
initializing them with good values, and improving them with backprop. 
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Now that we know how to create a convolution layer, let’s build a ful- 
ly-functional convnet. 


24.4.4 Using Convolution for MNIST 


Let’s build a CNN for categorizing MNIST images. To start with, we'll 
make a simple convnet with just one convolution layer, a flattening 
layer, and a fully-connected output layer. 


Our pre-processing step is unchanged from what we saw in Listing 
24.28. It reads in our data, normalizes it, re-shapes the features to the 
4D channels-first tensor, and makes one-hot encodings for the labels. 


Our model will have the architecture of Figure 24.27. 


32 xX (5x5) 
ReLU, same ae 


Figure 24.27: The architecture for a tiny convolutional neural network. 
We have one convolutional layer with 32 filters, each a square of 5 by 5 
elements. Following that is a flatten layer, and then a 10-neuron fully-con- 
nected layer to present our categorization outputs. 


To create our model, we start just as before, by creating a new object of 
type Sequential, then then adding layers one at a time. 


We'll start with a 2D convolution layer with 32 kernels, each 5 by 5. 
We'll use a relu activation function, and set padding='same', which 
will give the layer a temporary ring of O-padding, so that the output 
will have the same horizontal and vertical sizes as the input. Since the 
Dense layer at the end takes a list as input, we'll use a Flatten layer 
to turn the 28 by 28 by 32 tensor into a list of 28x28x32=25,088 ele- 
ments.We use softmax on that final layer, as before. 
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As usual, we'll wrap all of this up in little function which creates the 
model, compiles it, and returns the final result. 


Listing 24.32 shows the code. 


def make_simple_cnn_model(): 
model = Sequential() 
model.add(Conv2D(32, (5, 5), 
activation='relu', padding='same', 
input_shape=(image_height, image_width, 1))) 
model.add(Flatten() ) 
model.add(Dense(number_of_classes, activation='softmax' ) ) 
model.compile(loss='categorical_crossentropy', 
optimizer='adam', 
metrics=['accuracy']) 
return model 


Listing 24.32: The code to make our first CNN for classifying MNIST 
digits. 


As far as Keras is concerned, this is just a Sequential object like any 
other. We can train this model just as we always have, by calling fit() 
with all the necessary parameters. Just for completeness, Listing 24.33 
shows the code. Like our experiments in the last section, we'll run this 
model for 100 epochs, using a batch size of 256. 
simple_cnn_model = make_simple_cnn_model() 
simple_cnn_history = simple_cnn_model.fit(X_train, y_train, 
validation_data=(X_test, y_test), 
epochs=100, batch_size=256) 


Listing 24.33: To train our CNN, we only need to call its fit() method 
with the usual parameters. 


The results of our this training session are shown in Figure 24.28. 
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Figure 24.28: The accuracy and loss for our simple convnet. 


The numbers from the final epoch are shown in Listing 24.34. 


Epoch 100/100 
66s-Loss: 2.7044e-04-acc: 1.0000-val_loss: 0.1016-val_acc: 0.9862 


Listing 24.34: The final values from Figure 24.28. We removed some 
spaces to make the line fit. 


The good news about these results is that everything seems to be work- 
ing pretty well. Our system learns the training data and does a good 
job predicting the classes of the validation data, getting about 98.6% 
accuracy. 


On the other hand, these curves aren’t really looking great. The train- 
ing accuracy gets up to about 1.0 within 35 epochs or so, and the 
validation accuracy seems to plateau about there as well. That’s okay, 
but the loss curves tell a different story. The system starts overfitting 
before even 10 epochs are done, and it just gets worse as time goes on. 


As usual, to cure problems like this we need to follow our hunches. 
Let’s guess that maybe we would see better performance if we used a 
deeper model. We'll increase the number of convolution layers from 1 
to 3, and to control overfitting we'll add dropout after each one. 


Figure 24.29 shows our new, deeper architecture. 
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Figure 24.29: A bigger CNN using multiple convolution and dropout 
layers. 


The first convolution layer uses 16 filters of 5 by 5, and then we follow 
that with two layers that use 8 filters of 3 by 3. All of these numbers are 
more or less arbitrary, resulting from an initial guess and then some 
trial and error. We follow each convolution layer with a dropout layer, 
and at the end we flatten the result and feed it to a 10-neuron dense 
layer using softmax, as usual. 


Because we're using the border_mode='same' option, and the default 
stride of 1 by 1, the output of each convolution will be the same width 
and height as the input. 


Listing 24.35 shows a function to make this new model. Note that 
were setting the optional argument kernel_constraint to the value 
maxnorm(3), just as we did with the Dense layers earlier. For convolu- 
tion layers it prevents the values in the filters from getting too big, in 
the same way that it prevented the weights in the Dense layers from 
getting too big. 
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def make_bigger_cnn_model(): 
model = Sequential() 
model.add(Conv2D(16, (5, 5), activation='relu', 
padding='same', 
kernel_constraint=maxnorm(3) , 
input_shape=(image_height, image_width, 1))) 
model.add(Dropout (0.2) ) 
model.add(Conv2D(8, (3, 3), activation='relu', padding='same', 
kernel_constraint=maxnorm(3) ) ) 
model.add(Dropout (0.2) ) 
model.add(Conv2D(8, (3, 3), activation='relu', padding='same', 
kernel_constraint=maxnorm(3) ) ) 
model.add(Dropout (0.2) ) 
model.add(Flatten() ) 
model.add(Dense(number_of_classes, activation='softmax' ) ) 
modeLl.compile(loss='categorical_crossentropy', 
optimizer='adam', 
metrics=['accuracy']) 
return model 


Listing 24.35: The code to make the model of Figure 24.29. 


Listing 24.36 shows how we'd call this function to make a new model. 
bigger_cnn_model = make_bigger_cnn_model() 
bigger_cnn_history = bigger_cnn_model.fit(X_train, y_train, 
validation_data=(X_test, y_test), 
epochs=100, batch_size=256) 


Listing 24.36: Training the model of Listing 24.35. 


The results are shown in Figure 24.30. 
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Figure 24.30: The accuracy and loss from training our model of Figure 
24.29 for 100 epochs. 


The numbers from the end of this run are in Listing 24.37. 


Epoch 100/100 
127s-loss: 0.0094-acc: 0.9965-val_loss: 0.0565-val_acc: 0.9879 


Listing 24.37: The numbers from the final line of training our bigger CNN. 


We’ve pretty much eliminated overfitting, though the validation accu- 
racy is just a bit lower than before. As far as validation accuracy goes, 
it seems we could have stopped after about 40 epochs. 


There might be a small amount of overfitting going on, which we can 
try to reduce by adjusting hyperparameters, as we did before. 


On a late-2014 iMac, without GPU support, using the TensorFlow 
backend, this model took a little more than 2 minutes for each epoch. 
That’s about double the time required by our first model with just one 
convolution layer. 


Now that we have our feet wet, we can think about looking for better 
performance. But where do we start? 
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Progress on deep-learning architectures often comes from building 
on results other people have developed and published. Looking at the 
MNIST page [LeCun13], we can see architectures for convnets that 
have worked well on this data set. Most of these have some advanced 
or experimental features, but we can still emulate their basic structure. 


Let’s try making the image smaller and smaller as it works its way 
through the network. This way each layer gets to work with larger 
pieces of the original image. 


We'll do this first using pooling layers. Though we’ve noted pooling 
layers are falling out of favor in convnets, we'll use them now because 
they let us explicitly show how the tensor gets reduced in size as it 
flows through the network. 


Each pooling layer will look at 2 by 2 non-overlapping boxes, and 
return the largest value in that group of 4 input elements. As a result, 
the output of each of these layers will have half the width and height of 
its input. The input images are 28 by 28, so the output of the first max 
pooling layer will be 14 by 14, and the output of the second will be 7 by 
7. Of course, the depth of each of these tensors is given by the number 
of filters in the preceding convolution layer. 


We'll include three dense layers at the end, also of decreasing size. This 
is basically just working on a hunch that because we have fewer inputs 
arriving at the final dense layers just 7x7=49 values in all), we could 
benefit from more processing of those values. And what the heck, let’s 
include dropout after each convolution layer. 


Figure 24.31 shows a diagram of this architecture. 
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Figure 24.31: A CNN with dropout and pooling layers. 


Listing 24.38 shows a function to make this new model. 


def make_pooling_cnn_model(): 
model = Sequential() 
model.add(Conv2D(30, (5, 5), activation='relu', 
padding='same', 
kernel_constraint=maxnorm(3) , 
input_shape=(image_height, image_width, 1))) 
model.add(Dropout (0.2) ) 
model.add(MaxPooling2D(pool_size=(2, 2), padding='same' ) ) 
model.add(Conv2D(16, (3, 3), activation='relu', 
padding='same', 
kernel_constraint=maxnorm(3) ) ) 
model.add(Dropout (0.2) ) 
modeLl.add(MaxPooling2D(pool_size=(2, 2), padding='same' ) ) 
model.add(Flatten() ) 
model.add(Dense(128, activation='relu')) 
model.add(Dense(64, activation='relu') ) 
model.add(Dense(number_of_classes, activation='softmax' ) ) 
modeLl.compile(loss='categorical_crossentropy', 
optimizer='adam', 
metrics=['accuracy']) 
return model 


Listing 24.38: Making the model in Figure 24.31. 


Listing 24.39 shows how we'd call this function to make a new model. 
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pooling _cnn_model = make_pooling_cnn_model() 

pooling _cnn_history = pooling_cnn_model.fit(X_train, y_train, 
validation_data=(X_test, y_test), 
epochs=100, batch_size=256) 


Listing 24.39: Code to train the model of Figure 24.31. 


The results are shown in Figure 24.32. 
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Figure 24.32: Results for training the model of Figure 24.31 for 100 epochs. 


The numbers from the end of this run are in Listing 24.40. 


Epoch 100/100 
147s-loss: 0.0038-acc: 0.9988-val_loss: 0.0304-val_acc: 0.9939 


Listing 24.40: The final line of data from training the model of Figure 
24.31 for 100 epochs. 


We've picked up a small but meaningful increase in validation accu- 
racy, from 0.9879 to 0.9939. As we mentioned earlier, progress at this 
point often proceeds in tiny steps. This is actually pretty large, since 
we ve shaved off almost half of the distance to 1. 


And we seem to have no overfitting. In fact, after about the 50th epoch 
everything things pretty much settled. 
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We mentioned that pooling layers after convolution are being replaced 
these days by striding in the convolution layers themselves. Let’s 
implement this and replace the 2 by 2 max pooling layers with 2 by 
2 striding. The results will be a little different, because the process of 
striding the convolution kernels is not identical to moving them by 
single steps and then pooling, but we’d expect a rough similarity. 


Figure 24.33 shows a ee of this architecture. 
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Figure 24.33: Starting from our model of Figure 24.31, we've replaced the 
pooling layers with striding in the convolution layers. 


Listing 24.41 shows a function to make this new model. 


def make_striding_cnn_model(): 


model 
model 


model. 


model 


model 


model. 
model. 
.add(Dense(64, activation='relu')) 
model. 


model 


model. 


= Sequential() 


.add(Conv2D(30, (5, 5), activation='relu', 


padding='same', strides=(2, 2), 

kernel_constraint=maxnorm(3) , 

input_shape=(image_height, image_width, 1))) 
add (Dropout (0.2) ) 


.add(Conv2D(16, (3, 3), activation='relu', 


padding='same', strides=(2, 2), 
kernel_constraint=maxnorm(3) ) ) 


.add (Dropout (0.2) ) 


add(Flatten() ) 
add(Dense(128, activation='relu')) 


add(Dense(number_of_classes, activation='softmax') ) 
compi lLe(Loss='categorical_crossentropy', 
optimizer='adam', 
metrics=['accuracy']) 


return model 


Listing 24.41: Code to make the striding CNN of Figure 24.33. 
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Listing 24.42 shows how we'd call this function to make a new model. 

striding _cnn_model = make_striding_cnn_model() 

striding _cnn_history = striding_cnn_model.fit(X_train, y_train, 
validation_data=(X_test, y_test), 
epochs=100, batch_size=256) 


Listing 24.42: Code to build the striding CNN of Figure 24.33. 


The results are shown in Figure 24.34. 
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Figure 24.34: Accuracy and loss from training the striding CNN of Figure 
24.33 for 100 epochs. 


The numbers from the end of this run are in Listing 24.43. 


Epoch 100/100 
36s-loss: 0.0062-acc: 0.9978-val_loss: 0.0400-val_acc: 0.9912 


Listing 24.43: The final lines from training the model of Figure 24.33. 


We've lost a very tiny bit of accuracy in both the training and validation 
sets, but otherwise these numbers and their graph look pretty much 
like what we saw from using explicit pooling layers. 


But what has changed a lot is the timing. As we can see in Listing 24.40, 
the pooling model required about 147 seconds per epoch on a late-2014 
iMac without GPU, while the striding version took only 36 seconds per 
epoch. The striding epochs took only about 25% of time required by 
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the epochs that used explicit pooling. Clock times can be misleading, 
but it’s hard not to like getting nearly the same performance with only 
25% of the time and effort. 


Can we increase performance even more? 


We can play with any aspect of our network. We can add or remove 
filters at each layer, change their size, increase the dropout percent- 
age, add more convolution layers, and so on. Just for variety, let’s try 
replacing the dropout layers with batchnorm layers. Both are designed 
to reduce overfitting, so we can see which of the two techniques works 
best for this network and this data. 


Figure 24.35 shows a diagram of this architecture. 
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Figure 24.35: Our model of Figure 24.33, where we've replaced each 
dropout layer with a batchnorm layer. 


Listing 24.44 shows a function to make this new model. We’re setting 
the activation parameter in the convolution layers to None, because, 
as we saw in Chapter 20, batchnorm operates between a layer’s output 
and its activation function. So we follow each convolution layer with 
a BatchNormalization layer, and then a layer to apply the ReLU acti- 
vation function (recall that batch normalization was designed to take 
place after a layer’s outputs, but before the activation function). 
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def make_striding_batchnorm_cnn_model(): 
model = Sequential() 
model.add(Conv2D(30, (5, 5), activation=None, 
padding='same', strides=(2, 2), 
input_shape=(image_height, image_width, 1))) 
model.add(BatchNormalization() ) 
model.add(Activation('relu')) 
model.add(Conv2D(16, (3, 3), activation=None, 
padding='same', strides=(2, 2))) 
model.add(BatchNormalization() ) 
model.add(Activation('relu')) 
model.add(Flatten() ) 
model.add(Dense(128, activation='relu') ) 
model.add(Dense(64, activation='relu’' ) ) 
model.add(Dense(number_of_classes, activation='softmax' ) ) 
modeLl.compile(loss='categorical_crossentropy', 
optimizer='adam', 
metrics=['accuracy']) 
return model 


Listing 24.44: Code to make the striding-batchnorm CNN of Figure 
24.35. Note that we place the BatchNormali zation layer between the 
convolution layer and its relu activation function, placed on its own 
layer. 


Listing 24.45 shows how we’d call this function to make a new model. 


striding _batchnorm_cnn_model = \ 
make_striding_batchnorm_cnn_model() 
striding _batchnorm_cnn_history = \ 
striding_batchnorm_cnn_model. fit( 
X_train, y_train, 
validation_data=(X_test, y_test), 
epochs=100, batch_size=256) 


Listing 24.45: Code to build the striding-batchnorm CNN of Figure 24.35. 


The results are shown in Figure 24.36. 
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Figure 24.36: Accuracy and loss from training the striding-batchnorm 
CNN of Figure 24.35 for 100 epochs. 


The numbers from the end of this run are in Listing 24.46. 


Epoch 100/100 
45s-loss: 2.6886e-04-acc: 1.0000-val_loss: 0.0618-val_acc: 0.9911 


Listing 24.46: The final lines from training the model of Figure 24.35. 


The validation accuracy is about the same as we got before, but we 
seem to have picked up a small amount of overfitting. The epochs also 
take about 25% longer to run. 


The noise in the curves is a matter of some concern. We’d want to be 
careful when we stop training, to make sure we're not in one of those 
peaks in the validation loss (or the corresponding valleys in the valida- 
tion accuracy). This is one of those times when it makes a lot of sense 
to keep multiple checkpoints, and then choose one based on looking at 
the performance graphs. 


Overall, this variation doesn’t seem to have given us anything better 
than before. This is the value of experimenting: until we try, we can’t 
be sure how a network and a particular data set will behave. 
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24.4.5 Patterns 


Historically, many CNNs were assembled by repeating a few types 
of recognizable blocks of layers [Karpathy16a]. Such a block is a set 
of convolution layers, followed by a pooling layer. This block is then 
repeated several times, perhaps with different parameters to the con- 
volution layers. After that comes a series of fully-connected layers. 
Figure 24.37 shows an example of a such an architecture. 
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Figure 24.37: CNNs are often made up of repeating patterns. One popular 
pattern is a few convolution layers followed by a pooling layer, which we 
see here repeated three times. Typically, the parameters of the convolu- 
tions change from one block (in yellow) to the next. 


The pooling layers usually tile the input with 2 by 2 blocks. That is, the 
receptive field is 2 by 2, and we use a stride of (2,2) so that we produce 
a tiling with no overlaps or holes. This makes an output that has half 
the width and height of the input. 


This network is a simplified version of the VGG16 network we've seen 
before, drawn here to demonstrate the idea of repeated units, but just 
for fun, let’s run this on the MNIST data The results are shown in 
Figure 24.38. 
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Figure 24.38: The architecture of Figure 24.37 evaluating MNIST data. 


The final lines from training are in Listing 24.47. 


Epoch 100/100 
339s-loss: 0.0119-acc: 0.9974-val_loss: 0.0760-val_acc: 0.9902 


Listing 24.47: The final lines from training the model of Figure 24.37. 


This isn’t the best data we’ve seen, and there’s some overfitting, but it’s 
not bad for an essentially arbitrary network. It does take quite a while 
to run each epoch, as we might expect from all those layers. 


As we did before, let’s replace the pooling layers with striding in the 
final convolution layer of each set, as in Figure 24.39. We usually leave 
the stride of the other layers at 1. This is becoming a more attractive 
option as omitting the pooling layers gives us a smaller and faster 
network, and when things are well-tuned there seems to be no loss 
of performance [Karpathy16a] [Springenberg15]. The stride sizes are 
often (2,2), as they were for the pooling layers we’re replacing. 
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Figure 24.39: Replacing the pooling layers of Figure 24.37 with striding 
in the convolution layers. 


The results are shown in Figure 24.40. 
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Figure 24.40: The architecture of Figure 24.39 evaluating MNIST data. 


The final lines from training are in Listing 24.48. 


Epoch 100/100 
186s-Lloss: 2.8056e-04-acc: 1.0000-val_loss: 0.0815-val_acc: 0.9873 


Listing 24.48: The final lines from training the model of Figure 24.39. 
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The results are just a little noisier than before, but they land in roughly 
the same place. We even cut the time per epoch roughly in half, sug- 
gesting that striding in the convolution layer is significantly faster than 
subsequent pooling. This makes sense, since convolution with striding 
means we apply the convolution filter less frequently, and we can skip 
the subsequent post processing step of pooling altogether. 


24.4.6 Image Data Augmentation 


One of the best ways to improve performance of any model is to give it 
as much training data as we can, while avoiding overfitting. 


When we're working with images, we can easily create lots of new data 
by simply manipulating the images we already have, creating a wide 
variety of variations on the original. We could move each image left, 
right, up, or down, make it a little smaller or larger, rotate it clockwise 
or counter-clockwise by some amount, or perhaps flip it horizontally 
or vertically. Figure 24.41 shows some of these variations on an image 
of a Eurasian Eagle Owl. We’ve deliberately used extreme transforma- 
tions to make their effects easier to see. 
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Figure 24.41: Augmenting an image of a Eurasian Eagle Owl with rota- 
tions, flips, and scaling. The original is in the upper left. These transfor- 
mations are deliberately extreme to show their effect. 


The process of enlarging a dataset by creating variations is called data 
amplification, or data augmentation. 


When we're working specifically with images, Keras provides 
a built-in object to perform data augmentation. It’s called the 
ImageDataGenerator, and it performs all of the modifications we just 
mentioned, and a few others besides. 


As its name suggests, this object is a “generator,” which is a specific 
kind of object in the Python language [PythonWiki17]. In a nutshell, 
a generator can be thought of as a function that runs an internal loop, 
typically carrying out calculations and producing data. When that loop 
reaches a yield statement, the generator returns control to the rou- 
tine that called it, with the argument to yield set to its value, just like a 
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return statement. But if we call that function again, the loop picks up 
where it most recently stopped, and continues as though it had never 
been interrupted. 


The ImageDataGenerator is set up this way because we can config- 
ure it to produce large numbers of variations on each of our input 
images. This can take a lot of time and a lot of computer memory. So 
rather than compute all the variations ahead of time and hold on to 
them until they’re needed, the generator creates batches of images on 
demand. Each time we call the generator, it will produce and return 
another batch of images. A variation on the fit() method, which we'll 
see below, uses the generator as the source of training data, rather 
than tensors that we pass in. The routine calls the generator over and 
over, each time it needs more data. 


The transformations we applied to the images in Figure 24.41 were 
deliberately exaggerated to show their effects. In practice, we want to 
make new data that is close enough to the input to be plausible. After 
all, there’s no reason to learn from distorted inputs that are not rep- 
resentative of the data we expect to see. In fact, that could hurt our 
ultimate performance, since some of the network’s power would be 
uselessly directed to processing those inputs. 


If we want to use our generated data again later, we can save time by 
telling the generator to read and write its images in a given directory. 
Then each time we ask for another batch of images, it reads and returns 
the saved, transformed files if they’re available, or else generates them, 
saves them, and then returns them. This feature is also useful when 
we want to look at the generated files, to make sure we’ve selected the 
right options to get the sort of variations we're after. If memory is pre- 
cious and we're not pressed for time, we can just always skip the disk 
files altogether and make the transformed images fresh, on demand. 


The ImageDataGenerator is a workhorse, capable of applying all sorts 
of transformations to our images. We only need to list the image-trans- 
forming operations we want when we build the object, and it will 
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apply them all. We'll demonstrate the process with just a few trans- 
formations. The Keras documentation provides the complete list of all 
available options. 


The overall process of setting up and using the generator takes 
only two steps. First, we create our ImageDataGenerator with 
the options we want. Second, we train our model. But instead of 
using fit() to start training, we use fit_generator(). These both 
take the same arguments with one exception: The first argument to 
fit_generator() is a function that returns batches of samples. 


The usual function that we provide to fit_generator() is a function 
called flow(), which is automatically created for us as part of our 
ImageDataGenerator object. The metaphor is that the generator is 
producing a flow of data on demand, like water flowing out of a faucet 
when we turn the handle. Calling flow() provides a burst of training 
samples for our use. 


Together, fit_generator() and flow() manage the production of 
batches of images, and presenting them to our model for training. 
Listing 24.49 shows a typical use of ImageDataGenerator. 


# create the image generator with rotations and flips 
image_generator = ImageDataGenerator ( 
rotation_range=100, 
horizontal_flip=True) 
# fit our model using images produced by the image generator 
model. fit_generator (image_generator. flow( 
X_train, Y_train, batch_size=256), 
seed=42, epochs=100, 
samp les_per_epoch=lLen(X_train) ) 


Listing 24.49: Using an ImageDataGenerator to produce trans- 
formed images. Each image might be randomly flipped horizontally, and/ 
or rotated up to 100° in either direction. 


A few images produced by an ImageDataGenerator using just a single 
sample from the MNIST training set are shown in Figure 24.42. 
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Figure 24.42: Using an ImageDataGenerator to produce variations 
ona single MNIST sample for the number 2. For demonstration purposes 
we've allowed horizontal flipping, and a large range for rotation. In prac- 
tice, for this data, we’d probably use much less rotation, and we wouldn't 
allow flips, since a mirror-image 2 isn’t really a 2 at all. Some noise is 
visible around the edges where the low-resolution original image has 
been resampled. 


Normally, each run of this code will produce different results. For 
testing and debugging, it’s often useful to get back the same sequence 
every time. We can force this by setting the seed argument when we 
call flow(), as we did above. This has the same purpose as setting the 
seed for a random number generator, which sets it up to always pro- 
duce the same sequence of pseudo-random values. 


In Listing 24.49 we also told flow() how many samples we want per 
batch, how many samples make up an epoch, and how many epochs 
we want. 


It might seem odd to have to specify the number of samples per epoch, 
since until now the library has been able to infer that from the size of 
the input tensor. But the generator will just keep cranking out varia- 
tions as long as we keep calling it, so there’s really no sense of having 
run through “all the data,” which is what we normally call an epoch. 
Yet epochs are important. For example, it’s at the end of an epoch when 
statistics get collected and our callbacks are invoked. So we tell flow() 
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how many images to generate before it simply declares that an epoch 
has passed. The value of samples_per_epoch has to be a multiple of 
batch_size or we'll get an error. 


24.4.7 Synthetic Data 


This section’s notebook is 
Keras-Notebook-12-CNN-Synthetic-Data.ipynb 


We usually build networks because we want to deploy them for real- 
world use, so we train them on real-world data. But testing and 
practice datasets are useful for helping us experiment with architec- 
tures, pre-processing strategies, and hyperparameters. 


A great way to create an environment where we control everything is 
to train on our own data, which we generate on demand. Then we can 
make the data we want, rather than search for something out there 
that comes close. 


We use the phrase synthetic data to describe data that we create our- 
selves, usually on the fly with an algorithm. We saw synthetic data in 
Chapter 15, when we used scikit-learn’s built-in algorithms for making 
half-moons and blobs. 


The great thing about generating synthetic data is that we can make as 
much of it as we want or need, and then train with it as usual. 


It’s conceptually easy to do this. We just hook our data-generating pro- 
gram into a variation of the ImageDataGenerator object. 


The trick here is to modify the flow() routine inside our 
ImageDataGenerator object. Normally, flow() pulls the next sample 
out of the training set, and then applies our requested transformations 
to it. We can modify that step so that instead of pulling a sample from 
the training set, it calls a routine to create a brand-new new sample 
and its label. Then that new sample gets transformed and returned, as 
usual. 
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Although this is easily said, the mechanics get a little complicated, 
requiring some fancy footwork in Python, and the documentation to 
guide us is scant. Source can be found in the notebook for this chapter, 
which is adapted from an online example [Xie16]. 


To demonstrate the idea, we’ve written a little routine that draws an 
image in a 64 by 64 square. There are five types of images: a Z shape, a 
plus sign, three vertical lines, a squared U, and a circle. Each time we 
draw one of these shapes we wiggle the points around a little, so that 
no two shapes are the same. The function returns the image it drew 
and the label. The label is a number from oO to 4 that identifies which 
type of shape is in the image. 


Figure 24.43 shows a random collection of these images. Note that 
these variations are performed on the points that make up the image, 
and are inherently different than what we get by applying the types 
of deformations (scaling, rotating, etc.) that ImageDataGenerator can 
apply once it’s got the image. 


WP ZTOOU NL [27 
QvAvwnHneOni: © i 
OOP EH ZzTUIOP AZ 
LO NA PZ EZ 
AVUPMECOKCEE 


Figure 24.43: Synthetic images produced by a little routine. Each of the 
5 types of images uses points (and in the case of the circle, a radius) that 
have been randomly perturbed from their starting positions. 
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We used a version of ImageDataGenerator modified as above to train 
the simple convnet shown in Figure 24.44. 


32 xX (5x5) 
ReLU, same sone 


Figure 24.44: A simple convnet for classifying our synthetic data. 


The results are shown in Figure 24.45. 
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Figure 24.45: Accuracy and loss for our simple model on our synthetic 
data, generated on the fly. 


Despite being about as tiny as a classifier convnet can be, the net- 
work manages impressive accuracy on the test data, without obvious 
overfitting. 


24.4.8 Parameter Searching for Convnets 


Deep convnets can take a long time to train, particularly if we use big 
data sets. If we use the GridSearchCV or RandomizedSearchCVv objects 
we used earlier to search for hyperparameters, we might not have 
enough compute power to produce results in a reasonable time. 


1300 


Chapter 24: Keras Part 2 


There are faster alternatives. Unfortunately, they take some work to 
set up and use, so we won’t go into them here. A good place to start for 
automatic parameter searching is the Spearmint project [Snoek16a] 
[Snoek16b]. 


24.5 RNNs 


This section’s notebook is 
Keras-Notebook-13-RNN.ipynb 


As we saw in Chapter 22, recurrent neural networks, or RNNs, are 
great for sequential data. The MNIST image data we’ve been using 
is not sequential, because there’s no order to the images. 


Sequential data, on the other hand, is inherently ordered. Classic 
examples are daily temperatures, the daily price of a stock, and the 
hourly height of a tide. There’s also data that’s ordered, but not in time, 
such as children lined up by height, shelved library books, and the col- 
ors of the rainbow. 


In all of these phenomena, we want to use the information in the 
sequence of inputs to help us produce new output. 


In RNN terminology, we still have a dataset made of samples, where 
each sample contains multiple features. But now each feature contains 
multiple values, called time steps. Recall that we can also think of 
“time steps” as “series of measurements for a given feature.” Our exam- 
ple in Chapter 22 imagined a weather station on top of a mountain, 
taking multiple measurements every hour during 8 daytime hours. 
Each day’s results make up a sample, and each type of measurement 
(such as temperature and wind speed) is a feature. Each feature con- 
tains 8 time steps, with one value for each hour. 
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Throughout this chapter we’ve been classifying the MNIST data. But 
the MNIST data has no sequential qualities, and our goal now is not 
classification, but prediction of the next value in a sequence. So in the 
next section we'll generate some new data of our own that we can use 
to show how to set up and run RNNSs. 


24.5.1 Generating Sequence Data 


There are lots of sequential datasets available, but some of them are 
complicated or hard to draw. So let’s make our own simple dataset 
that we can easily draw and interpret. 


We'll just add up a bunch of sine waves, such as that at the top of Figure 
24.46. Each sine wave has a frequency, an amplitude, and a phase 
(or offset). We'll write a routine that takes lists of these values, called 
freqs, amps, and phases respectively, and uses them to add up all the 
waves at many points. Figure 24.46 shows the idea. 


1 
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Figure 24.46: Adding up sine waves. Top left: A single sine wave. Middle 
left: Our original wave, and a wave with smaller amplitude. Middle center: 
Our original wave, and a wave with a different phase. Middle right: Our 
original wave, and a wave with a different frequency. Bottom left: All four 


waves superimposed. Bottom center: All four waves added together. 


J 
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Our little curve-building routine takes three other arguments. The first 
is an integer called number_of_steps which tells it how many points 
to generate. The second is a float called d_theta that tells the routine 
the spacing of the samples (the name comes from thinking of the sine 
wave as based on an angle, which is often written with the lower-case 
Greek letter 0 (theta). Finally, skip_steps is an integer that provides 
an offset to the starting point, so we don’t always begin at oO. This is 
useful for creating the test data, which can start far to the right of the 
training data. 


The routine sum_of_sines() is shown in Listing 24.50. We wrote it to 
emphasize clarity. Since this is plain Python programming, and noth- 
ing specific to machine learning, we won’t go into the details. 


def sum_of_sines(number_of_steps, d_theta, skip_steps, 
freqs, amps, phases): 
'''"Add together multiple sine waves and return 
a list of values that is number_of_steps long. 
d_theta is the step (in radians) between samples. 
skip_steps determines the start of the sequence. 
The lists freqs, amps, and phases should all be 
the same length (but we don't check!)''' 
values = [] 
for step_num in range(number_of_steps): 
angle = d_theta * (step_num + skip_steps) 
sum = 0 
for wave in range(len(freqs) ): 
y = amps[wave] * math.sin( 
freqs[wave]*(phases[wave] + angle) ) 
sum += y 
values. append (sum) 
return np.array(values) 


Listing 24.50: A little routine to create a list that holds the sum of multiple, 
different sine waves. 
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We'll use this routine to generate two different data sets, which we'll 
call data set o and data set 1. The values we used to construct them 
were found by trial and error until they produced one graph that felt 
“calm,” so it wouldn’t be too hard to predict, and one that felt “busy,” 
for a harder challenge. 


We'll consider the data we use for evaluation to be a test set, which is 
used just once, rather than a validation set, which can be used multi- 
ple times when evaluating different forms of the network. 


Data set O is a gentle sum of two waves. We made one wave twice the 
speed of the other by setting freqs to (1,2), the second wave twice 
as high as the first by setting amps to (1,2), and started both waves 
at O by setting phases to (0,0), Our training data came from using 
200 steps (number_of_steps = 200), a step of about 0.057 radians 
(d_theta = 0.057), and no offset (skip_steps = 0). We chose the 
weird step size by eye so that 200 samples produced what we felt was a 
good amount of data for an easy test case. 


The training set is 200 samples long, starting at 0. The test set is 
another 200 steps, starting far to the right of the training set. Listing 
24.51 shows the calls to make this data set. 


train_sequence_1 = sum_of_sines( 
200, ©.057, 0, [1, 2], [1, 2], [0, 01) 


test_sequence_1 = sum_of_sines( 
200, 0.057, 400, [1, 2], [1, 2], [0, 01) 


Listing 24.51: Creating data set O. 


The resulting training and test data are shown in Figure 24.47. 
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Figure 24.47: The training and test data for our first set of sine waves. 


Data set 1 is a harder challenge that uses 4 waves. For this set, we set 
freqs to (1.1, 1.7, 3.1, 7), ampsto(1, 2, 2, 3), and we again 
left all the phases at 0, so phases is (0,0,0,0). The weird frequen- 
cies are chosen so that the pattern won’t repeat for tens of thousands 
of samples. The other variables are the same as for data set oO. Listing 
24.52 shows the calls to make the data. 


train_sequence_2 = sum_of_sines(200, 0.057, 0, 

[isis der, ool, 1 

[1, 2, 2, 3], [0, 0, 0, 0]) 
test_sequence_2 = sum_of_sines(200, 0.057, 400, 

Ie eS ee re 

[1, 2, 2, 3], [0, 90, 0, 0]) 


Listing 24.52: Creating data set 1. 


The results are shown in Figure 24.48. 
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Figure 24.48: The training and test data for our second, busier set of sine 
waves. 


24.5.2 RNN Data Preparation 


The mechanics for preparing data for RNNs in Keras are a little more 
complicated than what we’ve been working with so far, because we 
have to carry out a couple of reshaping steps in order to use all the 
library routines we want. We also need to extract our little windowed 
sublists, which we have to do ourselves since there aren’t any library 
routines to do it for us. 


Let’s dig into the steps and knock them down one by one in order. 


As before, we want to normalize our data to get it into the range [0,1]. 
The MinMaxScaler from scikit-learn is the perfect tool for the job. But 
recall from Chapter 15 that this routine expects our features to be 
arranged vertically, as in Figure 24.49. 
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Wind 

Temperature _ Rainfall Speed Humidity 
June 3 60 0.2 4 0.1 
June 6 15 0 8 0.05 
June 9 70 0.1 12 0.2 

[60, 75] [0, 0.2] [4, 12] [0.05, 0.2] 

new June 3 0 1 0 0.33 
new June 6 1 0 0.5 0 
new June 9 0.66 0.5 1 1 


Figure 24.49: The MinMaxScaler, like most feature-wise normalizers, 
reads all the values for each feature, finds the minimum and maximum, 
and re-scales the data to the range [0,1]. Each feature (a column in this 
example) is scaled independently (this is a variant of Figure 12.37). 


Our sine wave data has only one feature, with many time steps, and 
it’s a 1D list (that is, it’s not a column as MinMaxScaler is expecting). 
So let’s reshape our data into a column. In Python, that means making 
a 2D grid that is as tall as our collection of measurements, and just one 
element wide, as in Figure 24.50. 


1307 


Chapter 24: Keras Part 2 


starting 
sequence data 
as a 1D array 
prepared 


beans as a 2D array 
— eS 


100 elements 


100 


(a) (b) : 


Figure 24.50: Reshaping our list of sine wave values as a 2D grid with one 
column. 


We can use reshape() to make this, as shown in in Listing 24.53, 
where train_sequence and test_sequence can be the corresponding 
variables from either data set 0 or 2. 


train_sequence = np.reshape(train_sequence, 
(train_sequence.shape[0], 1)) 

test_sequence = np.reshape(test_sequence, 
(test_sequence.shape[0], 1)) 


Listing 24.53: Our input data contained in two 1D lists called train_ 
sequence and test_sequence. To prepare them for MinMaxScaler 
we turn each list into a column of elements. 


Now that we have our data in the right format to give to MinMaxScaler, 
we'll make an instance of that object and then call its fit() routine on 
the training data. It will find the minimum and maximum values, and 
remember them. Then, as usual, we apply the transformation to both 
the training and test data by calling the scaler’s transform() method, 
as in Listing 24.54. To keep things clear, we'll give the results new 
names, prefixed with scaled_. 
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from sklearn.preprocessing import MinMaxScaler 


min_max_scaler = MinMaxScaler(feature_range=(0, 1)) 
min_max_scaler.fit(train_sequence) 

scaled_train_sequence = min_max_scaler.transform(train_sequence) 
scaled_test_sequence = min_max_scaler.transform(test_sequence) 


Listing 24.54: We make our transformation from the training data, and 
then apply it to both the training and test data, which we save in new 
variables. 


Note that, as usual, we first fit the scaling object to the training data, 
and then applied that transformation to the test (or validation) data. 
Our object min_max_scaler remembers its transformation, so we'll 
later be able to apply its inverse to the output of our network, giving us 
a result in the same range as the input. 


Now that our data is normalized, it’s time to create the little windowed 
sublists that make up our training and test data. Figure 24.51 shows 
the idea. 


input Sequence 
window O 


window 1 | | | | | 


























| 
window 2 | | | | | 
| 





window 3 | | | 


window 4 
window 5 
window 6 


Figure 24.51: Chopping up an input sequence into a series of sub-lists. 
Each sub-list has the same length, called the window length. In this 
example, the windows are overlapping, with each one starting just one 
element to the right of the previous window. 
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Once we have the windows, we can split them into the sample, which 
will be everything but the final value, and the target, which is that final 
value, as in Figure 24.52. 


sub-list 


sample target 


Figure 24.52: We break up each windowed list into all the elements but 
the last, which make up our sample, and the last element, which is the 
target. 


Building these windows is a common operation when working with 
RNNs. There are probably a million ways to write a routine to do this 
job. We'll present a little version that emphasizes clarity. It moves 
the starting point of a window of the given size from the start of the 
sequence to a bit before the end, stopping at the last position where 
the whole window still fits inside the input sequence. Because this is 
straight Python programming, and doesn’t have anything to do with 
machine learning, we won’t go deeper into the details. Listing 24.55 
shows our routine. 


1310 


Chapter 24: Keras Part 2 


def samples_and_targets_from_sequence(sequence, window_size) : 
'''Return Lists of samples and targets built from overlapping 
windows of the given size. Windows start at the beginning of 
the input sequence and move right by 1 element. 
samples = [] 
targets = [] 
# 1 ts starting position 
for i in range(sequence.shape[0]-window_size): 
# sub-list of elements 
sample = sequence[i:it+twindow_size] 
# element following sample 
target = sequence[it+window_size] 
# append sample to list 
samples. append(sample) 
# append target to list 
targets.append(target[0] ) 
# return as Numpy arrays 
return (np.array(samples), np.array(targets) ) 


Pree 


Listing 24.55: A Python routine to convert a list of values and a window 
size into two new lists. The first contains multiple overlapping sub-se- 
quences from the original list. Each sub-sequence is 1 less than the given 
value of window_size. The second list contains the next value in the 
original sequence, which will be our target when training and testing. 


We can now create our training and test data just by handing our scaled 
sequences to this routine. Listing 24.56 shows the code. As before, 
we'll assign the windowed training data to X_train and y_train and 
the windowed testing data to X_test and y_test. We'll assume that 
the integer variable window_size has been set. 


(X_train, y_train) = samples_and_targets_from_sequence( 
scaled_train_sequence, window_size) 

(X_test, y_test) = samples_and_targets_from_sequence ( 
scaled_test_sequence, window_size) 


Listing 24.56: How to create our training and test data using the utility 
function in Listing 24.55. 
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Now that we have our data, we have to make sure it has the necessary 
shape. 


Getting the shape of the data into the right form for the network is 
as essential for RNNs as it is for all the other types of networks we’ve 
seen. If we organize the data in a way that doesn’t match what the net- 
work is expecting, we'll typically either get an error, or our network 
will go haywire and produce crazy results. 


The good news is that we've already accomplished that mission, 
because we wrote the routine samples_and_ta rgets_from_sequence 
to return its data in the shapes that we need for RNN training in Keras. 


Let’s look at those shapes, so it’s clear how the data is structured. 


The easy part is the targets, which we're saving in y_trainand y_test. 
These are just 1D lists. 


The training and test data that we’re saving in X_train and X_test 
are 3D blocks. The X_train block is as deep as the number of windows 
that we were able to make (that is, the number of samples), and as tall 
as the window size itself (that is, the number of time steps). The block 
is as wide as the number of features we’re learning. Since we have only 
1 feature in this dataset, the block is only 1 element wide. 


This is the structure that we want for training RNNs in Keras. The 
depth of the block tells us the number of samples we’ll train with. The 
time steps are arranged vertically, and the features horizontally. Figure 
24.53 breaks down this shaping for an imaginary data set with 3 sam- 
ples, each made of 2 features, each with 7 time steps. 
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Figure 24.53: The structure of data prepared for RNN training. We have 
3 samples, each containing 2 features, which in turn hold 7 time steps. (a) 
The X_train data set ready for learning by an RNN. (b) The first sample 
from X_trajin is the slice of the block that is closest to us. Here it’s the 
elements labeled A through G. This can be represented as a 2D grid. (c) 
The first feature in this sample is located in the leftmost column. (d) The 
elements inside that column are the time steps corresponding to that 
feature. 


As we mentioned, we set up our pre-processing so that we now have 
our data in the proper shape, as shown in Figure 24.54. 
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Figure 24.54: The shape of our training data, assuming we have 95 
samples and a window size of 7. The test data has a similar shape, though 
fewer samples. 


This wraps up the pre-processing of our data so it’s ready for training. 


24.5.3 Building and Training an RNN 


Now that our data is properly normalized and structured, we can cre- 
ate the network and train it. 


We'll create an extremely simple RNN that runs quickly, yet still 
demonstrates all the basic principles. We'll have one recurrent layer, 
followed by one dense layer. Figure 24.55 shows the architecture. Note 
that the dense layer has no activation function listed. That’s because 
we don’t want it to modify the value it computes, since that value is our 
prediction. We can either say that we've left off the activation function, 
or we've set it to the linear function. 
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Figure 24.55: Our first RNN will have 1 recurrent layer with 3 elements 
of memory, followed by a fully-connected layer with one unit. The input 
contains as many time steps as are in our window. The dense layer has no 
activation function. 


Recall from Chapter 22 that there are two standard types of building 
blocks used in RNNs: the LSTM (Long Short-Term Memory), and GRU 
(Gated Recurrent Unit). Keras supports both of these, and the layers 
are named for the unit they use. 


Let’s use an LSTM. An RNN icon, as in Figure 24.55, represents an 
LSTM unit unless explicitly marked otherwise. 


To create an LSTM layer, we just create an LSTM object with the options 
we want, and then put it into our model with add() as usual. 


How much memory should be in the state for this layer? Let’s arbi- 
trarily start with just 3 elements. 


When we make an LSTM layer, we specify the number of cells we want. 
Because this will be our first layer of the network, we also have to sup- 
ply the input dimensions. As usual, we set the argument input_shape 
to the shape of one sample. From our discussion above, we know that 
each sample is a 2D grid whose height is the number of time steps 
(that’s the height of our window), and whose width is the number of 
features (we have just 1), aS we saw in Figure 24.53. 


Listing 24.57 shows how to create this layer. 
lstm_layer = LSTM(3, input_shape=[window_size, 1]) 


Listing 24.57: How to create an LSTM layer object. The first argu- 
ment is the number of LSTM cells in the layer. The second argument, 
jinput_shape, tells the size of a single sample. 
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We follow this up with a dense layer of a single neuron. If we don’t 
specify an activation function in a Dense layer, Keras defaults to None. 
This is good for us in this situation, because we don’t want the output 
to be squashed down to the range [0,1] or [—-1,1] or any other range. 
Just to be explicit, we'll include a redundant assignment of None to the 
activation function. 


The complete model is shown in Listing 24.58. 


# create and fit the LSTM network 

model = Sequential() 

model.add(LSTM(3, input_shape=[window_size, 1])) 
model.add(Dense(1, activation=None) ) 


Listing 24.58: How to create our RNN model. We just create the model 
as a Sequential object as usual, add in our RNN layer (in this case, an 
LSTM object), and then we place a one-neuron Dense layer at the end. 


That’s it for building our model. Now we just compile it and run. 


As usual, to compile the model we need to supply a loss function and 
an optimizer. We’ve been using the Adam optimizer in this chapter 
and it’s been working great, so let’s keep using it. For the loss func- 
tion, we don’t want to use the same categorization function we used 
earlier, because we’re no longer doing categorization. What we want 
is something that will compare the single value that comes out of 
our network with the target value for that sample. Consulting the 
list of loss functions in the Keras documentation, we can see that the 
mean_squared_error loss function does the job, so let’s use that. 


The compilation step is shown in Listing 24.59. 
model.compile(loss='mean_squared_error', optimizer='adam' ) 


Listing 24.59: How to compile our CNN. We use the 'adam' optimizer 
as before. We've chosen to use the 'mean_squared_error' loss func- 
tion which is appropriate for this RNN. 
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Now that our model is compiled, we train it just like every other model 
by calling fit(). The only change we'll make here is to use a batch size 
of 1, because some experimentation showed that for this tiny network 
and this small dataset, that produced the best results. Listing 24.60 
shows our command to train the network. 


history = model.fit(X_train, y_train, epochs=number_of_epochs, 
batch_size=1, verbose=2) 


Listing 24.60: Training our RNN with fit(). This is like all of our other 
training steps, except we've set batch_size to 1. 


Bringing these steps together gives us the code in Listing 24.61. This 
builds our RNN, and then trains it for whatever number of epochs we 
choose to save in the variable number_of_epochs. 


# create and fit the LSTM network 

model = Sequential() 

model.add(LSTM(3, input_shape=[window_size, 1])) 

model.add(Dense(1) ) 

model.compile(loss='mean_squared_error', optimizer='adam' ) 

history = model.fit(X_train, y_train, epochs=number_of_epochs, 
batch_size=1, verbose=2) 


Listing 24.61: Building and training our RNN. 


Now that we have our model trained, we can ask it for its predictions. 
In Listing 24.62 we compute predictions for both the training and test 
data. 


y_train_predict = model.predict(X_train) 
y_test_predict = model.predict(X_test) 


Listing 24.62: We get predictions from our model by handing the sample 
data to the model’s predict() method, as usual. 


Before we look at the results, we should mention that when we use the 
network to make predictions, we rarely use the results directly. 
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The issue is that we’ve applied the MinMaxScaler to our input data 
(both training and test), transforming it from our original values to a 
range more suitable for training. That means that the network’s pre- 
dictions are also transformed. For example, if our original data was in 
the range [-6,6], then after the transformation it will be in the range 
[o,1]. This means that the predictions will be in the range [0,1] as well. 


Because of this, we cannot directly compare our predictions with our 
original data. In the case of a simple scaling, like we’re doing here, 
that’s not a major issue. But if we perform a more complicated process- 
ing step, then it could be very hard to mentally interpret the predicted 
data. 


As we saw in Chapter 12, the general solution is to inverse-trans- 
form the predicted data. Like most of scikit-learn’s transformation 
routines, MinMaxScaler come with a method called inverse_trans- 
form() that does just this. 


In Listing 24.63 we use the inverse_transform() routine of min_max_ 
scaler (our MinMaxScaler object) to un-transform our predictions. 
We'll also invert the transform on our training and test targets. 


# inverse-transform original targets 

inverse_y_train = min_max_scaler.inverse_transform([y_train] ) 

inverse_y_test = min_max_scaler.inverse_transform([y_test] ) 

# inverse-transform predictions 

inverse_y_train_predict = \ 
min_max_scaler.inverse_transform(y_train_predict) 

inverse_y_test_predict = \ 
min_max_scaler.inverse_transform(y_test_predict) 


Listing 24.63: We invert both the original target data and the predicted 
targets forboththetraining and test sets. This undoes the scaling operation 
we performed using the transform() method of min_max_scaler(). 


Inverting (that is, un-transforming) the original targets y_train and 


y_test may seem wasteful. Why not simply save the original targets in 
their un-transformed form, and use them here? 
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Let’s look again at the last two lines of Listing 24.54, repeated here as 
Listing 24.64. 


scaled_train_sequence = min_max_scaler.transform(train_sequence) 
scaled_test_sequence = min_max_scaler.transform(test_sequence) 


Listing 24.64: A repeat of the last two lines of Listing 24.54, where we 
applied transformations to our data. 


We can see that we transformed the entire windowed sequence before 
we split it into a sample and a label, so we never really had the labels 
y_train and y_test sitting around in non-transformed variables 
before. 


We structured the code this way for clarity and simplicity. It’s certainly 
reasonable to extract and save the labels before they’re transformed. 
Then we wouldn’t need to undo the transformation here. Either way 
works. We'll stick with the version we just presented. 


Now that we have the predictions back in the original range of the data, 
we can plot them with the original data and see how good our predic- 
tions are. We can also use them to get a quick numerical summary of 
accuracy using a measure called the root mean squared error, or RMS 

error. This is a standard way to measure error that lets us compare 

apples to apples when we look at multiple networks. It’s close to what 

the 'mean_squared_error' loss function is computing, except that we 

include a square root. To compute this, we use the square-root routine 

sqrt() from the math module, and the mean_squared_error() rou- 
tine from scikit-learn. Listing 24.65 shows the steps. 
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from sklearn.metrics import mean_squared_error 

trainScore = math.sqrt(mean_squared_error ( 
inverse_trainy[0], 
inverse_y_train_predict[:,0])) 

print('Training RMS error: {:.2f}'.format(trainScore) ) 


testScore = math.sqrt(mean_squared_error(inverse_testyY[0], 
inverse_y_test_predict[:,0])) 
print('Test RMS error: {:.2f}'.format(testScore) ) 


Listing 24.65: Computing and reporting the root-mean-squared (RMS) 
error for our training and test predictions. 


The funny indexing comes from the different shapes of the origi- 
nal targets and their predictions. In our code, inverse_y_train and 
inverse_y_test are 2D grids with one row (for example, if there 
are 30 targets in each set, their shapes would be 1 by 30), so we get a 
list containing the data stored in the first (and only) row by selecting 
inverse_y_train[0]. On the other hand, the predictions coming back 
from model.predict() are 2D grids that have one column (so they 
would be 30 by 1). We get a list of the data in the first (and only) col- 
umn by selecting inverse_y_train_predict[:,0]. 


Issues like getting these indices right are frequently hard to anticipate, 
so we often discover them only when we write some code that seems 
reasonable but then messes up. Using interpreted Python interactively 
lets us examine the shapes of our variables step by step and line by 
line, and develop the proper adjustments and selections to select the 
data we want at each step. 


24.5.4 Analyzing RNN Performance 


Let’s try out our sine wave data on this tiny RNN. We'll start with the 
easier, first data set in Figure 24.47, shown here again as Figure 24.56. 
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Figure 24.56: The “calm” data set. This figure is a repeat of Figure 24.47. 
There are 200 data points in this set. 


Recall that our data-generating routine with the ungainly name 
samples_and_targets_from_sequence() takes an argument called 
window_size that lets us specify how many time steps are to be used 
in each sample. 


Let’s arbitrarily start with a window size of 3 steps. This means each 
sample will have 3 values, and we'll ask the network to predict the 
one that comes after. We'll always provide that value as the target, so 
during training the system can learn to match that value, and during 
testing we can see how well we did. 


As usual, we'll train for 100 epochs. After each epoch, we get back a sin- 
gle number that tells us our loss, measured by the difference between 
the value we predict and the target. 


The loss is plotted in Figure 24.57. 
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Figure 24.57: The loss from training 100 epochs of our RNN on data set 
O with a window of 3 time steps. 


The loss drops quickly to about 0.06, then more gently to something 
close to zero around epoch 8, and then over the next 60 epochs or so 
it continues to drop, finally hitting a value visually hard to distinguish 
from zero at around epoch 8o. 


Let’s draw our predictions on top of our data so we can eyeball our 
performance. 


Figure 24.58 shows our training data in black, and the predictions in 
red. Recall that we’re using 200 samples in our training set. 
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Figure 24.58: Our training data for set 0 is shown in black. Overlaid on 
that are the predictions from our RNN after 100 epochs of training with 
a window of 3 time steps. 


This is pretty spectacular for a network with only 1 LSTM unit with 3 
elements of state and 1 neuron after that. The match between the pre- 
dictions and the real values isn’t perfect, as we can see near the tops of 
the hills and bottoms of the valleys, but it’s pretty great. 


The predictions start slightly after the start of the training data, because 
the first prediction is the 4th element of the training data. This is hard 
to see in this figure, but will be easier to spot in later results (such as 
Figure 24.70). 


Let’s now looks at the test data. Figure 24.59 shows our test data in 
black, and the predicted data in red. Recall that we’re also using 200 
samples in our test set. 


1323 


Chapter 24: Keras Part 2 


test set 0, window 3 


— test 
— test predict 


test and prediction 





0 50 100 150 200 
index 


Figure 24.59: Our RNN’s predictions for the test data of set O, after 100 
epochs of training with a window of 3 time steps. 


The test data looks similar to the training data because it’s made from 
the same repeating sine waves, just located later in the sequence. The 
test predictions are close, again messing up near the extremes of the 
hills and valleys. 


Maybe we don’t even need 3 samples in our window. What if we try a 
window of just 1 sample? So we give the network a single value, and 
ask it to predict the next one. This only has even a hope of working 
because our dataset probably doesn’t have any numbers that repeat 
exactly. So if it can learn the value that comes after each value in the 
training set, it should be able to reproduce those numbers. The loss is 
plotted in Figure 24.60. 
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Figure 24.60: The loss from training 100 epochs of our RNN on data set 
O with a window of 1 time step. 


Figure 24.61 shows its predictions on our test data. 
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Figure 24.61: Our RNN’s predictions for the test data from set 0, after 
100 epochs of training with a window of 1 time step. 
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We've clearly lost considerable performance, but the network is doing 
a great job at memorizing input/output pairs. 


Even though 3 steps did a good job predicting our test data, let’s go 
the other way and crank up our window size up to 5 time steps. Since 
we re just trying to get a feeling for things now, rather than carry out a 
detailed analysis, we'll skip the curve showing the training predictions, 
and go straight to the loss curve and test predictions. Figure 24.62 
shows the loss for a window of 5 time steps. 
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Figure 24.62: The test set loss from training 100 epochs of our RNN on 
data set O with a window of 5 time steps. 


Figure 24.63 shows our test predictions for a window of 5 steps. 
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Figure 24.63: Our RNN’s predictions for the test data from set 0, after 
100 epochs of training with a window of 5 time steps. 


Visually, this looks a lot like our window of 3 steps in Figure 24.59. 
Perhaps a window of 3 pieces of data was enough for the network to do 
a really good job of predicting the 4th. 


Let’s try the more complicated data in our second test set, shown in 
Figure 24.48, repeated here in Figure 24.64. 
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Figure 24.64: Data set 1. This figure is a repeat of Figure 24.48. 
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Since a window of 3 worked well for the simple data, let’s try that again. 
The test set loss is plotted in Figure 24.65. 
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Figure 24.65: The test set loss from training 100 epochs of our RNN on 
data set 2 with a window of 3 times steps. 


As with the first data set, the loss plunges at the start and then slows 
its descent. There’s a knee around epoch 4 and a more gradual one 
around epoch 50, until the hits zero around epoch 80. This suggests 
that this test set is harder for our tiny network to learn than the last 
one, which makes sense. 


Let’s look at how well this window of 3 matches the test data. Figure 
24.66 shows our results. 
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Figure 24.66: Our RNN’s predictions for the test data of set 2 after 100 


epochs of training with a window size of 3 time steps. 


This is a pretty great match for such a tiny network and such a wiggly 
set of data, particularly with such small windows. Let’s try a window of 
5 elements, shown in Figure 24.67. 
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Figure 24.67: The predictions for the complex dataset with a window of 
5 steps. 


Wow. Increasing the window from 3 to 5 seems to have mostly back- 
fired, though in some places, like the peak around epoch 140, the 
match improved. 


Look back over these results, our tiny network of just 1 LSTM unit with 
3 steps of memory, and 1 final neuron, did a great job on both test sets. 
The window size usually made a difference. For these simple datasets, 
a window of 3 or 5 seemed to usually do a very good job. As we'll see 
later, more complicated datasets will often need larger windows. 


24.5.5 A More Complex Dataset 


We used only one recurrent layer so far because it worked so well. But 
we can build deep recurrent networks by simply adding in more 
recurrent layers. Depending on the data, it may be best to have just 
a few recurrent layers, each with lots of units of state memory, or it 
might be better to have many layers, each with just a small amount of 
memory. 


1330 


Chapter 24: Keras Part 2 


We can make a small change to our sine-wave data to make it much 
more challenging for an RNN to learn: any time the curve is heading 
downwards, we'll flip it around the X axis so that it’s heading upwards. 
Our curve will change from being smooth to choppy, with abrupt 
jumps. This is a completely arbitrary operation that we’re applying to 
create a more challenging dataset for our networks. 


Figure 24.68 shows this operation applied to the training data for our 
second set of waves, creating a third set of data, which we'll call data 
SEL 2, 
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Figure 24.68: Creating a third set of data by starting with our second set 
of training data, only any time the curve starts to head downwards, we 
reflect it around the X axis. This is the data we'll use to train our deep 
RNNs. 


Naturally enough, we call the modified data-making routine 
sum_of_upsloping_sines(), and present it in Listing 24.66. 
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def sum_of_upsloping_sines(number_of_steps, d_theta, 
skip_steps, freqs, amps, phases): 
'''Like sum_of_sines(), but always sloping upwards''' 
values = [] 
for step_num in range(number_of_steps) : 
angle = d_theta * (step_num + skip_steps) 
sum = 0 
for wave in range(len(freqs) ): 
y = amps[wave] * math.sin( 
freqs[wave]*(phases[wave] + angle) ) 
sum += y 
values. append (sum) 
if step_num > 0: # are we past the first sample? 
# find the direction we're headed in 
sum_change = sum - prev_sum 
if sum_change < 0: # are we going downward? 
values[-1] *= -1 # if so, flip the last 


Listing 24.66: A modification of sum_of_sines() where we flip the 
curve upside down any time it’s heading downwards. The new lines are in 
the if statement and the assignment just after it. 


Let’s run our new dataset through the same tiny network as in our 
most recent experiment, keeping the window size of 5. The loss results 
are shown in Figure 24.69. 
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Figure 24.69: The test loss from running our upsloping test data through 
our 3-unit RNN with a window of 5 time steps. 
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The loss drops fast at the very start, and then improvement slows down 
a lot. Around epoch 60 it seems to either have settled down, perhaps 
improving just a tiny bit from then on. 


The predictions by this model to our modified test data are shown in 
Figure 24.70. 


test set 2, window 5 


test and prediction 





0 50 100 150 200 
index 


Figure 24.70: The quality of the predictions of our model to the modi- 
fied test data. 


The late start (where the first 4 values have no prediction) is easier to 
see here at the far left. This result is pretty bad. The predicted values 
do tend to generally track the straighter sections of the test data, but 
the predictions frequently overshoot and undershoot the peaks and 
valleys. 


Our tiny network worked surprisingly well, but we’ve finally asked too 
much of it. 
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24.5.6 Deep RNNs 


Let’s add a second recurrent layer, also made of 3 LSTM units. 


Figure 24.71 shows our new architecture. 





Figure 24.71: The diagram for our 2-layer RNN. The small box on the 
output end of the first LSTM indicates that it has return_sequences 
set to True. 


Listing 24.67 shows how to build our 2-layer deep RNN model, using 2 
layers of 3 LSTM units each. 


The first LSTM needs us to include a new argument, return_sequences, 
which we need to set to True. We indicate this with a small box on the 
right of the LSTM icon (or, when the network is drawn bottom to top, 
on the top). We'll discuss what this return_sequences is about soon, 
but for now we can treat it as something that has to be included any 
time we create an LSTM that is followed by another LSTM. 


Here’s the code, with return_sequences set to True in the first LSTM. 


model = Sequential() 

model.add(LSTM(3, return_Ssequences=True, 
input_shape=[window_size, 1])) 

model.add(LSTM(3) ) 

model.add(Dense(1) ) 


Listing 24.67: Building a deep RNN just means adding more recurrent 
layers. All recurrent layers that precede another must have their optional 
argument return_sequences set to True. 


As far as Keras is concerned, this is just another Sequential model, so 
we can train this model and get predictions from it just as before. 
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Let’s see how this performs with our new data. 


Figure 24.72 shows the loss during training. 
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Figure 24.72: Training our model with 2 recurrent layers of 3 LSTM units 
each produces these loss results. 


The loss gets down to a little more than 0.02, which is roughly what we 
saw before. So we shouldn’t get too optimistic about the predictions. 


Figure 24.73 shows the predictions on the test data. 
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Figure 24.73: Training our model with 2 recurrent layers of 3 LSTM units 


each produces these predictions to the test data. 


This is pretty bad. Around epochs 130 to 180 it seems to lose track of 
the value of the data, though it does roughly mimic its rising and fall- 
ing. It seems that adding a second layer has made things a lot worse. 


Let’s see if we can get something better with an even deeper model. 
Let’s make 3 LSTM layers of decreasing sizes, with 9, 6, and 3 units 
respectively. Figure 24.74 shows this architecture. 


wo OOO X 


3 
Figure 24.74: The diagram for our 3-layer RNN. 


Listing 24.68 shows the code to build the model. 
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model = Sequential() 

model.add(LSTM(9, return_Ssequences=True, 
input_shape=[window_size, 1])) 

model.add(LSTM(6, return_sequences=True) ) 

model.add(LSTM(3) ) 

model. add(Dense(1) ) 


Listing 24.68: Making an even deeper RNN with 3 recurrent layers of 
decreasing numbers of LSTM units. 


Once again, each LSTM that feeds another LSTM has to have 
return_sequences Set to True, indicated by the small box at the top of 
the icon. 


Figure 24.75 shows the loss of our 3-layer model during training. 
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Figure 24.75: The loss from training our model with 3 recurrent layers of 


9,6,and 3 LSTM units. 


This is encouraging, because while the loss at epoch 100 is still only 
about 0.025, like in previous runs, the loss is still dropping, while the 
previous loss graphs were flat. If we kept learning, we ought to expect 
further improvements. It’s not a dramatic improvement, though. 


Figure 24.76 shows the prediction results for this deeper RNN. 
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Figure 24.76: The test data predictions made by our deep RNN with 9, 6, 
and 3 LSTM units on successive layers. 


The improved numerical loss is encouraging, but the network is still 
not doing a good job visually. This might even be worse than before. 


24.5.7 The Value of More Data 


Remember our general principle that more data is usually better than 
fancier algorithms. So rather than continue to tweak the network, let’s 
get more data. 


One of the pleasures of working with synthetic data is that we can 
make as much of it as we want. The values of the frequencies in the 
test 2 dataset don’t repeat for a long time, so we can crank out a lot of 
data (over 40,000 samples) without repeating. Let’s increase the size 
of our training set from 200 samples to 2000. To match the 10-fold 
increase in the number of training samples, let’s increase the window 
from 5 to 13, leaving all the other parameters the same. 


The loss during training is shown in Figure 24.77. 
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Figure 24.77: Loss from training our 3-layer network with 2000 samples 
and a window size of 13 time steps. 


This is a huge drop in loss. Figure 24.75 the loss got down to about 
0.025 after 100 epochs. Here, the loss seems to be about 0.004. 


If we plot all 2000 samples they'll jam up and give us a black rectan- 
gle, so let’s look at the just the first 200 samples. This has the added 
benefit that we’re already familiar with them. Figure 24.78 shows the 
predictions on our test data. 
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Figure 24.78: Test data predictions by our 3-layer deep RNN after training 
on 2000 samples chunked into windows of 13 time steps. 


This is a big improvement. There’s still some over- and under-shoot- 
ing going on, but generally we have a much better match than before. 
More data really does help! 


Thanks to our procedural data generator, can crank up the size of our 
training set by another factor of 10 to see what happens. We'll leave 
everything else the same, but increase the training set from 2,000 to 
20,000 samples. 


The loss results are in Figure 24.79. 
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Figure 24.79: Loss from training our 3-layer network with 20,000 samples 
and a window size of 13 time steps. 


The loss has dropped to about 0.0015, which is less than half of the 
roughly 0.004 we had before. 


Figure 24.80 shows the predictions on our test data. 
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Figure 24.80: Test data predictions by our 3-layer deep RNN after training 
on 20,000 samples chunked into windows of 13 time steps. 
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This is the best set of predictions of this difficult dataset that we’ve 
seen yet. There are still a few obvious misses, but compared to previ- 
ous results this is pretty close. 


Even more data helps even more! 


It’s worth noting that the time taken to learn these increasingly large 
training sets is roughly proportional to the number of samples. Each 
epoch of the 20,000 sample training set took a little over 5 seconds 
using only the CPU on a late 2014 iMac (that is, there was no GPU 
acceleration). So Figure 24.80 took a bit over 28 hours to compute. 
Every one of these epochs took about 10 times longer than each epoch 
of the 2,000 sample set, which in turn were about 10 times longer than 
epochs when training the 200 sample set. More data is great, but pro- 
cessing it comes at a price. 


The good news is that we pay that price only once, during training. 
We've been steadily improving our 3-layer RNN without changing the 
model, so all of these models take the same amount of time to predict 
new values after training is over. Our up-front training cost is amor- 
tized over every use of our model, forever. 


As we’ve seen, RNNSs are sensitive to how they’re trained. Deep RNNs 
are even more sensitive. Because we didn’t tune these architectures 
at all as we added new layers, we’re probably leaving a lot of perfor- 
mance on the table. By adjusting the window size, the learning rate, 
the parameters to our Adam optimizer, and our choice of loss function, 
we might be able to improve our best results in Figure 24.80. In prac- 
tice, it is usually worth exploring the effects of different modeling and 
training parameters to see what works best for a given network and 
data. 
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24.5.8 Returning Sequences 


This section’s notebook is 
Keras-Notebook-14-RNN-Sequence-Shapes.ipynb 


In our deep networks above, we used the output of one RNN as the 
input of another RNN. We saw that the earlier RNN layer (the one 
providing the input to the next) needed a new argument. Its name was 
return_sequences and we set it to True (the default is False). Now 
it’s time to make good on our promise to discuss what that argument 
is about. 


Let’s return to our first, simple network of an RNN that followed by a 
dense layer, as we saw back in Figure 24.55. Our goal was to hand the 
network a sequence of time steps, and then have it predict the next 
value after the sequence. 


Our samples contained just one feature, which held a series of values 
from a 1D curve. These made up the time steps. 


When we gave the RNN our sample, it read the first time step and pro- 
duced an output. This output was the contents of the internal state, 
after passing through the RNN’s internal selection gate. The output 
could be thought of as the RNN’s prediction of the next value of the 
curve. 


But we didn’t care about that prediction, because we already knew the 
second time step. Keras knew we had more time steps to come, so it 
automatically ignored that output, and didn’t even send it to the dense 
layer. Instead, it gave the RNN the second time step. Again, the RNN 
produced an output, and again, Keras ignored it. Figure 24.81 shows 
the idea, for an RNN with 4 elements of internal data. Here we’ve 
handed the RNN the third time step, and it’s produced the third out- 
put, which we’re ignoring. 
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Figure 24.81: Using an RNN with 4 elements of state. The input sample 
has a single feature, containing a list of time steps. After each time step is 
evaluated, the RNN produces an output 4 elements long, as shown here 
for the 3rd time step. We've been ignoring all outputs except the last one. 


We did this over and over, handing the RNN sequential time steps and 
ignoring the outputs, until we gave it the last time step in the sample. 
The output of that time step was the prediction for the value of the 
sequence after the end of our inputs, so that output was the value we 
were after all along. We fed that to our dense layer, and the output was 
the prediction. 


Suppose our inputs had more than one feature. If our data held 
weather measurements at the top of a mountain, maybe each sample 
held temperature, wind speed, and humidity. Let’s say what we want 
from our RNN is a prediction of how good the radio reception would 
have been on the mountain at that time. So at each time step we give 
the RNN the values for all three features at that time step. The output 
is again the RNN’s internal state after the selection gate, so it has as 
many elements as there are elements in the internal state. As before, 
we get back one such output for every time step input we provide, and 
we only pay attention to the last one. Figure 24.82 shows the idea. 
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Figure 24.82: When we have multiple features in our sample, we provide 
all the features for a given time step to the RNN. 


There are a couple of things worth noting in Figure 24.82. 


First, our input is a 3D tensor. In this example, it has 1 sample, 7 time 
steps, and 3 features, so it has dimensions 1 by 7 by 3. The number 
of time steps, and the number of features, don’t appear in the output, 
which is a 2D tensor of shape 1 by 4. The 1 is because we only care 
about one output (the last one), and the 4 comes from the internal 
state of the RNN, which we've assuming has 4 elements. 


We “lost” the number of features because they are used internally by 
the RNN to control the forgetting, remembering, and selecting of the 
internal state. We “lost” the number of time steps because we chose to 
ignore all but the last one. 


Our input could have had 19 features and 37 time steps, and the out- 
put would still be 1 by 4. 


Let’s expand the picture a little to include the unrolled RNN diagram. 
In Figure 24.83 we have a sample with 5 time steps and 3 features. 
Once again, the RNN’s internal state has 4 elements. We can see in the 
figure that at each time step, an entire row of features is fed to the RNN, 
which produces an output. Then the RNN’s state changes, which sets 
up the RNN for the next input, as shown by the open downward-point- 
ing arrow. We only pay attention to the last output. The output is a 2D 
grid of shape 1 by 4. 
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Figure 24.83: Passing a sample to an RNN. The unrolled RNN diagram is 
shown vertically. As before, the outputs are ignored except for the final 
time step. 


If we want to feed this output to a dense layer, we don’t have to do a 
thing. 


But suppose we want to take our sequence of outputs and present them 
as a sequence of inputs to another RNN layer, as we did in some of our 
deep RNN models above. We know that an RNN needs a 3D input, and 
the output here is 2D. 


We could just give it a depth of 1, producing a shape that’s 1 by 1 by 4. 
While this is now legal for an RNN, it doesn’t make any sense. A ten- 
sor with this shape would be interpreted as a single sample (the first 1) 
with 1 time step (the second 1), containing 4 features (the 4 at the end). 
That’s nothing like our single sample of 5 time steps and 3 features. 


Losing the time step information is a big problem, because that’s the 
idea at the heart of an RNN. We’re giving our first layer 5 time steps, 
and it’s producing 5 outputs. We then want to hand those 5 outputs to 
the next layer. Each output will have 4 elements (since we’re suppos- 
ing that our RNN has 4 elements of internal state), so those 4 values 
will be interpreted by the next RNN as 4 features. But we need the 5 
time steps. 
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That’s actually easy to do. We just tell Keras to not ignore the output 
after each step. We tell it to take the outputs and stack them up to 
make a grid. It will be as tall as there are time steps, and as wide as 
there are elements in the internal state. Now we can give that grid a 
depth of 1, and it makes sense as an input to an RNN. 


Figure 24.84 shows the idea. 
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Figure 24.84: To send an RNN’s output to another RNN, we just remember 
the output at every time step. These outputs are stacked together to 
forma 2D grid, and then we give it a depth of 1 to mean it’s all one sample, 
and we're ready to give this to another RNN. 


To tell Keras to remember the output after each time step and build 
up this grid, we tell it that we want the RNN to return not just a sin- 
gle output, but the whole sequence of outputs corresponding to the 
sequence of inputs. 


By setting the optional argument return_sequences to True, we're 
telling Keras to do exactly what Figure 24.84 shows. 


Now that we know what return_sequences is all about, we can usu- 
ally invoke it without even thinking about all of this. If our RNN’s 
output is going into another RNN, just set return_sequences to 
True. If we want only the output after the last time step, we can set 
return_sequences to False, or just leave it off, since that’s the default 
value. 


A few input shapes and their outputs with return_sequences set to 
both False and True are shown in Figure 24.85. 
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Input return_sequences return_sequences Input return_sequences_ return_sequences 
= False = True = False = True 
1x3x1 1x4 1x3x4 1x5x1 1x4 1x5x4 


1x3x2 1x4 1x3x4 





2X3X2 2x4 2X3x4 2x5x2 2x4 2x5x4 


(a) (b) 


Figure 24.85: The output of a 4-cell RNN for different shapes of 
input. In each box, the input shape is on the left, the output with 
return_sequences=False is in the middle, and the output with 
return_sequences=True is on the right. 


It’s useful to see at a glance whether an RNN returns just the final out- 
put or the full sequence. We mark the icon for an RNN that returns a 
sequence with a small box on the output side, suggesting multiple out- 
puts, as in Figure 24.86. 
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Figure 24.86: Icons for RNN units. Left: When return_sequences is 
False, the RNN returns only the final value in the sequence. Right: When 
return_sequences is True, the RNN returns its output for each time 
step. We mark this with a small box on the output side of the icon. 


24.5.9 Stateful RNNs 


We've been focusing on one sample at a time, but in practice we usu- 
ally train in mini-batches. This poses an interesting question for RNNs, 
since their internal memory is always influenced by previous inputs. 
When should we clear that memory and let the RNN start over? 


The usual approach is to clear the internal memory at the start of a 
new batch, or mini-batch. We don’t clear or reset the weights belong- 
ing to the neurons inside the RNN, since those tell it how to do its job. 
We only clear its changing memory that holds the inputs it’s recently 
seen. The thinking is that when a new batch begins we'll possibly be 
getting data that isn’t a continuation of the most recent samples, so we 
don’t want to remember stuff from back then. 


Usually, we shuffle our samples between epochs, so they arrive in an 
unpredictable order each time. But we can keep their order consistent 
from epoch to epoch, if we want to. We do this when we call fit() to 
train our model, setting the optional argument shuffle to False (the 
default is True). 


When the data is always arriving in sequence, there’s no reason to reset 
the memory at the start of each batch, because those samples follow 
the samples in the previous batch. In other words, the batching just 
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breaks up the grouping of the samples, and not their sequence. In that 
situation, we can tell Keras to not reset the memory at the start of each 
batch. This sometimes can help us train a little faster. 


In Keras, when we take over the responsibility for clearing the mem- 
ory we say that the RNN is in the stateful mode. In this mode, Keras 
only resets the internal state when we tell it to. Usually this is at the 
start (or end) of each epoch. 


Stateful mode can make training go a little faster, but it comes with 
limitations. The batch size must be determined in advance, and it 
becomes a part of the model. The dataset must be a multiple of this 
batch size. For instance, if the batch size is 100, the dataset must be 
100, 200, 300, and so on samples long. If it’s 130 or 271 samples, we'll 
get an error. 


When we later give new data to the model for it to evaluate, that data 
also has to come in batches of the same size we used when we trained. 
If we want only one prediction, but our batch size is 100, then we can 
either pad out our one request with 99 more copies of itself, or just 
load up all the unused entries in the batch with o’s. We'll still end up 
waiting for the network to evaluate all those samples, though. 


To make a stateful network, we need to do four things. 


First, we need to include the optional argument stateful to each RNN 
(such as an LSTM or GRU) and set it to True. This tells Keras that 
were taking care of when to reset the cell’s state. 


Second, we need to include the argument batch_size to the first RNN 
we make, and set it to the batch size that we’re going to use during 
training. 


Third, when we call fit() we need to set shuffle to False. 


Finally, when we want to reset the state, we need to explicitly call 
reset_states() on our model. 
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Listing 24.69 shows an example of the first two points while building a 
small RNN with two LSTM layers and a one-neuron Dense layer at the 
end. This is adapted from the stateful_lstm.py example in the Keras 
documentation [Cholleti7b]. We assume that the variable time_steps 
holds the number of time steps in our inputs, and batch_size is set to 
the batch size we plan to use. 


model = Sequential() 
model.add(LSTM(50, 
input_shape=(time_steps, 1), 
return_sequences=True, 
batch_size=batch_size, stateful=True) ) 
model.add(LSTM(50, stateful=True) ) 
model.add(Dense(1) ) 


Listing 24.69: Building a stateful RNN. We need to give the first LSTM a 
value for batch_size, and set stateful=True in each LSTM. Adapted 
from [Chollet17b]. 


To train our model, we need to remember to tell fit() not to shuffle 
our data. Because we want to reset the state after each epoch, we don’t 
want to do the usual thing of telling fit() how many epochs to train 
for, and then walking away, because then the RNN will never be reset. 


Instead, we'll tell fit() to train for only 1 epoch, and we'll put that call 
in a loop. The loop will repeat for the number of epochs we want to 
train for. Doing it this way lets us put in a call to reset_states() at 
the end of each epoch of training. 


Listing 24.70 shows this step. 


for i in range(number_of_epochs) : 
model. fit(X_train, y_train, batch_size=batch_size, 
epochs=1, verbose=1, shuffle=False) 
model.reset_states() 


Listing 24.70: Training a stateful RNN. We need to tell fit() not to 
shuffle the data, and then we need to call reset_states() after each 
epoch (adapted from [Chollet17b]). 


1351 


Chapter 24: Keras Part 2 


24.5.10 Time-Distributed Layers 


As we've seen, when we set an RNN unit’s return_sequences argu- 
ment to True, Keras saves its output after each time step. 


We saw in Figure 24.84 that this results in one output for each time 
step in the sample. The outputs for a sample get gathered together 
into a grid, and the grids for many samples get gathered together into 
a volume. 


Let’s suppose we want to process this output volume in a Dense, or ful- 
ly-connected layer. We’d need to flatten it first into a 1D list, and then 
feed that list to the layer. Sometimes that’s fine. 


But other times, we’d like the dense layer to process the individual 
outputs one by one. So we want to run the dense layer over each of the 
separate lists coming out of the RNN in Figure 24.84 rather than over 
the 2D grid they get assembled into. 


Once these lists have been assembled into a grid, it would be hard to 
pull them apart. We could write our own custom layer, or make a cus- 
tom contraption using the Functional API (discussed below), but we’d 
like an easier approach that lets us treat these individual outputs one 
by one. 


Keras provides a special-purpose layer for exactly this job. It’s called 
a TimeDistributed layer. It’s not really a layer, though. Keras calls it 
a wrapper layer. The idea is that it’s a container that we put one or 
more layers into, and then those layers get treated in a special way. 


To get a feeling for what the TimeDistributed wrapper does for us, 
let’s build a tiny network without one. Figure 24.87 shows an RNN 
with 4 elements of state, followed by a Dense layer with 5 neurons. 
Since there’s no box on top of the RNN icon, we know that it doesn’t 
return a sequence. 
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Figure 24.87: A small network to set up our discussion of the 
TimeDistributed layer. 


As we can see, the input is 1 sample, made of 1 feature, with 5 time 
steps. After passing through the RNN, we get a single 4-element list. 
We can then feed that directly into a Dense layer of 5 neurons, getting 
back a list of 5 values at the output. 


Let’s have the RNN return the individual sequence outputs by setting 
return_sequences to True. Now we have Figure 24.88(a), where we 
had to insert a Flatten layer between the RNN and the dense layer. 
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Figure 24.88: When an RNN returns a sequence, we can apply some other 
layer to each step of the sequence by wrapping it ina TimeDistributed 
layer. (a) The RNN’s 3D output does not fit the Dense layer’s need for a1D 
list, so we can flatten it first. But then the Dense layer is processing all 20 
outputs at once. (b) If we wrap the Dense layer ina TimeDistributed 
layer, Keras will hand it each sequence of the output in turn, and then 
combine the results again. 
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Although flattening the RNN output as in Figure 24.88(a) works, and 
everything will run, it’s not what we want. The problem is that the 
Dense layer will process all 20 values coming out of the RNN at once. 
What we want is for it to process each time step individually. 


If we wrap the Dense layer in a TimeDistributed wrapper, we invoke 
a bunch of machinery inside of Keras that gives us an operation shown 
in Figure 24.88(b). Each time step is individually handed to the Dense 
layer and then the results are combined. Note that in this figure there 
is only one Dense layer which gets applied to all 5 time steps. 


Another version of Figure 24.88(b) is shown in Figure 24.89. On the 
left we show how we'd draw our network schematically. The five-sided 
shape around the Dense layer is our icon for the TimeDistributed 
layer. The V at the bottom is meant to suggest the branching of the 
lines in Figure 24.88(b), telling us that the one input is being broad- 
ened. On the right is an expanded view of what’s happening inside of 
the TimeDistributed layer. Again, there’s just one Dense layer that is 
being applied to each time step in this sample. 
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Figure 24.89: By wrapping our Dense layer in a TimeDistributed 
layer, the Dense layer will individually process each sequential output 
from the RNN, and then those results will be assembled into a new tensor. 


To create a layer wrapped in a TimeDistributed layer, we just nest 
the calls, as in Listing 24.71. 


model = Sequential() 

model.add(LSTM(4, return_Ssequences=True, 
input_shape=[window_size, 1])) 

model.add(TimeDistributed(Dense(5) ) 


Listing 24.71: Feeding the output of an LSTM layer to a Dense layer, 
wrapped up in a TimeDistributed wrapper, as in Figure 24.89. 
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Figure 24.90 shows our icon for a TimeDistributed layer with differ- 
ent contents. With the Functional API (discussed below) we can wrap 
many layers with just one TimeDistributed wrapper. Alternatively, 
we can wrap each one individually. 


xX 


TT = ae 


Figure 24.90: Three collections of layers, each ina TimeDistributed 
wrapper. 


24.5.11 Generating Text 


The letter by letter notebook is 
Keras-Notebook-15-Generate-Text-By-Letter.ipynb 


The word by word notebook is 
Keras-Notebook-16-Generate-Text-By-Word.ipynb 


In Chapter 22 we experimented with generating new text based on the 
stories of Sherlock Holmes. 


This isn’t hard to do, but it requires more than just a couple of lines 
of Python programming. The notebooks for this section contain all 
the code for making new text, either letter by letter or word by word. 
Rather than go through all the details, we'll just walk through the big 
pieces and mention some highlights. Our code is influenced by a popu- 
lar presentation available online [Karpathy15]. 


In previous notebooks in this chapter we’ve presented the code as 
esentially a single big list of lines to be executed in order. We broke 
up the lines into conceptual chunks and placed them in cells, but that 
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didn’t change how we wrote or used the code. For variation, this time 
we packaged up each of the steps into its own procedure. Then when 
we re ready to make text, we just call some of those procedures and let 
them do their work. 


Our first step is to read in the source text. We replaced multiple spaces 
with single spaces, and removed newline characters since they don’t 
have any semantic meaning. 


This code is shown in Listing 24.72, where we’ve wrapped up the job of 
reading and processing the file into a routine called get_text(). 


from keras.models import Sequential 
from keras. layers import Dense, Activation 
from keras.layers import LSTM 
from keras.optimizers import RMSprop 
import numpy as np 
import random 
import sys 
def get_text(input_file): 
# open the input file and do minor processing 
file = open(input_file, ‘'r') 
text = file.read() 
file.close() 
#text = text. lower () 
# replace newlines with blanks, double blanks with singles 
text = text.replace('\n',' ') 
text = text.replace(' ‘', ' ') 
print('corpus length:', len(text) ) 
return text 


Listing 24.72: To generate new text, we start by reading in and processing 
the text file with our source text. 


Now we have to chop up the input into overlapping windows. We 
need to pick the window size and how much they overlap. The routine 
build_fragments() creates these little fragments for us, as in Listing 


24.73. 
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def build_fragments(text, window_lLength) : 

# make overlapping fragments of window_length characters 

fragments = [] 

targets = [] 

for i in range(0, len(text)-window_lLength, window_step): 
fragments.append(text[i: 1 + window_length] ) 
targets.append(text[i + window_Llength] ) 

print('number of fragments of length window_length=', 
window_length,':', lLen(fragments) ) 

return (fragments, targets) 


Listing 24.73: We build text fragments by breaking up the input text into 
overlapping pieces, each with window_length characters. 


Since our network wants numbers, not letters, we'll assign a unique 
number to each letter. To make it easy to go back and forth, we'll make 
two dictionaries. One is keyed on characters and returns their number, 
and the other is keyed on number and returns their character. We'll 
call the number an “index.” We can get the total number of unique 
characters by using Python’s set() operation. Just for general tidi- 
ness we'll sort that list before using it. Listing 24.74 shows the routine 
build_libraries() that does the job. 


def build_dictionaries(text): 
unique_chars = sorted(list(set(text) )) 
print('total unique chars:', len(unique_chars) ) 
char_to_index = 
dict((ch, index) for index, ch in enumerate(unique_chars) ) 
index_to_char = dict((index, ch) for \ 
index, ch in enumerate(unique_chars) ) 
return (unique_chars, char_to_index, index_to_char) 


Listing 24.74: We build a pair of dictionaries to let us turn each letter 
into a unique number, and vice-versa. 


Now we want to turn our samples and targets into one-hot vectors. 


We're already familiar with using one-hot targets. We'll use one-hot 
encoding for the samples here as well because we want each letter to 
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be a feature in our data. That feature will have as many time steps as 
there are unique characters in our data. They'll all be 0 except for a 1 
corresponding to the character being represented. 


Listing 24.75 shows one way to build the one-hot versions. We'll make 
a couple of grids full of zeroes, and then set the ones where needed. 


def encode_training_data(fragments, window_length, targets, 
char_to_index, index_to_char): 
# Turn inputs and targets into one-hot versions 
X = np.zeros((len(fragments) , window_length, 
Llen(char_to_index)), dtype=np.bool) 
np.zeros((len(fragments), Len(char_to_index)), 
dtype=np.bool) 
for i, fragment in enumerate(fragments) : 


y 


for t, char in enumerate(fragment) : 
X[i, t, char_to_index[char]] = 1 
ylLi, char_to_index[targets[i]]] = 1 
return (X, y) 


Listing 24.75: Turning our fragments and targets into one-hot versions 
called X and y. 


Now let’s build the model. After a little playing around, we chose the 
simple deep model of Figure 24.91. It’s just two LSTM layers and a sin- 
gle Dense layer. The first LSTM has return_sequences=True, because 
it feed another LSTM. The second one produces a single output, which 
will lead us to the letter the network is predicting. To get that letter, 
we use a Dense layer with one neuron per letter, and a softmax output. 
This will give us the probability of each character being the next one. 
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Figure 24.91: Our simple deep network for generating text one letter at 
a time. 


Listing 24.76 gives the source for building this model. 


def build_model(window_length, num_unique_chars) : 
# build the model. Two layers of a single LSTM cell with 128 
elements of memory,then a dense layer with as many outputs 


We'll tratn with the RMSprop optimizer. Some experiments 
suggest starting with a learning rate of 0.01 
model = Sequential() 
model.add(LSTM(128, return_sequences=True, 
input_shape=(window_length, num_unique_chars) ) ) 
model.add(LSTM(128) ) 
model.add(Dense(num_unique_chars, activation='softmax' ) ) 
optimizer = RMSprop(lr=0.01) 
modeLl.compile(loss='categorical_crossentropy', 
optimizer=optimizer) 


# 
# as there are characters. 
# 
# 


return model 


Listing 24.76: Build our little RNN architecture. 


Now we're ready to generate text. We'll call a new routine called 
generate_text() that will train the model for a single epoch, and 
then print out some text that it generates. This way we can see how the 
quality of the text improves over time. 


After each call to fit() to train the model, we'll pick a random start- 
ing point in the original document and extract characters from there. 
We'll pick as many characters as in the window size we trained on. 
We'll one-hot encode that sequence of characters and give the result 
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to predict(). This will give us back one probability for each unique 
character in the original text, telling us how likely it is that that char- 
acter is the one that comes next after the input text. 


We could just use the most probable character, but in practice that 
tends to give us a lot of repeated words. A nice alternative is to juggle 
around the probabilities a little so that less-likely letters also have a 
chance of being chosen. A nice algorithm for that adds metaphorical 
“heat” to the probabilities to change their values [Chollet17c]. We’ve 
wrapped that up in a routine called choose_probability() that’s in 
the notebook. 


Once we've got the prediction for the next character, we append that 
prediction to a growing output string. Then we append the new char- 
acter to the end of our input to the model, while also dropping the 
first character from that string, so the input is always the length of the 
training windows. Then we train the system for another epoch and do 
it all again. 


The code for generated_text() is shown in Listing 24.77. Rather 
than simply printing strings to the output, we hand them to a routine 
named print_string() that both prints them, and saves them in a 
file that we’ve opened. 
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def generate_text(model, X, y, number_of_epochs, temperatures, 
index_to_char, char_to_index, file_writer): 
# train the model, output generated text after each iteration 
for iteration in range(number_of_epochs) : 
Dds COIN e aa ee a \n', 
file_writer) 
print_string('Iteration '+str(iteration)+'\n', 
file_writer) 
history = model.fit(X, y, batch_size=batch_size, epochs=1) 
start_index = random.randint(0, len(text) -window_lLength-1) 
for temperature in temperatures: 
print_string('\n----- temperature: '+\ 
str(temperature)+'\n', 
file_writer) 
seed = text[start_index: start_index+window_lLength] 
generated = seed 
prant string (  ==—-—— Generating with seed: '+\ 
'<' +seed+'>\n', 
file_writer) 
for i in range(generated_text_lLength) : 
xX = np.zeros((1, window_Length, 
Len(index_to_char) ) ) 
for t, char in enumerate(seed): 
x[@, t, char_to_index[char]] = 1. 
preds = model.predict(x, verbose=0) [0] 
next_index = choose_probability(preds, 
temperature) 
next_char = index_to_char[next_index ] 
generated += next_char 
seed = seed[1:] + next_char 
print_string(generated+'\n\n', file_writer) 
file_writer.flush() 


Listing 24.77: Generate new text using our trained model. 


The majority of the work in this program involves messing about with 
the data, making the dictionaries and windows and doing the one-hot 
encoding and so on. The actual neural network code was just a few lines 
to make the network, and one line each to train it and get predictions. 
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To train the system, we picked a window length of 40 characters, and a 
step of 3, so each training string overlapped 37 characters with the one 
before. We used a batch size of 100, and generated 1000 new char- 
acters after each training step, using “temperatures” of 0.5, 1.0, and 
1.5. The text with temperature 0.5 tended to produce the same words 
frequently, and temperature 1.5 produced mostly words, but also lots 
of strings that weren’t words. It’s fun to play with the temperature to 
find the sweet spot where the output is interesting, with the occasional 
weird almost-word. 


As we mentioned in Chapter 22, this can take a long time to run. On a 
late 2014 iMac, without GPU support, each iteration takes about 1400 
seconds, or a little more than 23 minutes. Networks like this often 
take 800 epochs or so to start producing text that is close to the source. 
That would be about 13 days of 24/7 crunching. So we ran this network 
for 100 epochs on Amazon Web Services, watching the loss drop from 
about 2.6 to 1.1. Here’s the start of what it generated after that much 
training, starting with the seed “last time in my life. Certainly a gray 
m”, and with a temperature of 1.0. 


last time in my life. Certainly a gray myself under the great tau- 
toh; harm | should be a busy because cameful allo done.” “Why 
dud that you dedy hour any one of these chimnes of this pricap- 
tion is to his, If the tall. Up appeared to very set over with Mr. 
Trem, there, if we confeeliin, | fawny of days if so far 


Clearly we have a long way to go. But remember that this is letter by 
letter, from a system that has no idea of English or language or any 
such structure. Given that it started from nothing and had only such a 
small amount of training, this is pretty great. 


An alternative way to generate text that we discussed in Chapter 22 
is to focus on sequences of words, rather than letters. This is appeal- 
ing in many ways, but it’s also slower to train. If we have 7000 or 
8000 unique words, that’s a lot more work to manage than 89 unique 
characters. 
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We experimented with this a bit and chose the architecture in Figure 
24.92 to generate new text word by word. The full code is provided in 
the accompanying notebook. 


Cl xX 


8 0.3 8 0.3 8000 
softmax 


Figure 24.92: A network for word-by-word text generation. 


We trained with the 8000 most frequently-used words in the text, 
replacing all others with the marker GLORP. Here’s the output after 
the first epoch, starting with the seed, “tell her future husband the 
whole story and to trust to his generosity .” Milverton chuckled . “You 
evidently do not know the Earl , ” said he . From the baffled look upon 
Holmes’s face , I could”. Note that the punctuation marks have been 
isolated as their own words. 


tell her future husband the whole story and to trust to his 
GLORP .” Milverton chuckled . “You evidently do not know the 
Earl,” said he. From the baffled look upon Holmes’s face, | could 
each clear at screen At there by put got His you openly is do that 
were once Your plans from my He greatest life to did mantle it 
first India” drive as come really It black build my is put hearty 
Stanley sprang , afraid once quite whom had comes sole snuff 
Francisco 


Training on an Amazon Web Services GPU-enabled p2.xlarge instance 
took 15 minutes per epoch. Over the first 10 epochs, the training loss 
dropped from about 6.8 to about 5.3. But with so many thousands of 
words to choose from, things didn’t get much better. 


Here’s the output from epoch 10, starting with the seed, “it would be a 
grief to me to be forced to take any extreme measure . You smile, sir , 
but I assure you that it really would .’ GLORP is part of my trade ,’ I’. 
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it would be a grief to me to be forced to take any extreme 
measure . You smile , sir, but | assure you that it really would .’ 
GLORP is part of my trade ,’ | door small little who very lamps 
into dropped imagine the the GLORP,. his that the would nose 
, tell. Smith said was the The and is . a know to would are none 
very had there was are It a Mother upon away my for - and the 
about are not the for to | open one, it far ? 


We'd have to spend a lot of time training this model before the results 
got interesting. The notebooks for this section provide the source for 
generating new text, either letter by letter or word by word, for those 
with the compute power and patience to dig in. 


24.6 The Functional API 


So far in this chapter we’ve built our models by placing one layer after 
another. The Sequential API that we’ve been using was designed for 
just this sort of architecture. Keras offers a second way to build our 
models, called the Functional API. 


The reason for a second API is that sometimes we want to build mod- 
els that are not strictly sequential. For example, in Chapter 25 we'll 
build a model called a variational autoencoder. It starts with a 
sequence of layers, but then it splits into two, with two different layers 
getting input from the same predecessor, as in Figure 24.93. Then we 
combine those two layers back into one and continue with a sequential 
model. 
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Figure 24.93: This model cannot be made with the Sequential API. We 
need to use the Functional API to build this. 


We can’t make a network like Figure 24.93 with the Sequential API, 
because it assumes that each layer gets input and sends output to no 
more than one layer. 


Using the Functional API, creating layers and connecting them are two 
separate operations. We can first make whatever layers we need, and 
then connect them together however we like. 


The functional API is powerful, but it can also be complex and subtle. 
Here we'll stick just the basics that will let us make a model like Figure 
24.93. If we just want to make a straight chain of layers, like one the 
sequential API builds, we can make that with the functional API as 
well. 


The key thing to know about this approach is that each layer is its own 
object. That is, we create it and assign it to a variable. Once we create a 
layer, it contains its own weights, parameters, and internal processing. 
We will then connect that layer to other layers to build the model. 
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But because each layer is an object, we can use it more than once. Let’s 
look at an imaginary model that we might use for image classification. 
We've decided that the first two layers will be fully-connected, or Dense, 
layers of 100 and 200 neurons, as in Figure 24.94(a). 


100 200 1 

ReLU ReLU linear 
© rips — CLF 

100 200 16 x (5x5) 


ReLU ReLU 


Figure 24.94: Re-using layers in Keras. (a) A small network using two 
hidden Dense layers, shown in red. (b) A different network that re-uses 
the two red layers from part (a) as the first stages. We can build both 
models and use one and then the other. The layers in red are not copied, 
but shared, so any learning from either network is also used in the other. 


Let’s suppose that later on we want to build a different model for 
roughly the same task. But this time we want to put a convolution layer 
at the end, as in Figure 24.94(b). We can re-use the first two layers 

from our first model. These aren’t copies, but the same layers, just tak- 
ing place in this new network. They retain all the weights that they've 

learned when they were part of the other network. So as we train either 

model, the other model is trained as well. 


It may help to think of the layers, the connections, and the models 
as three different ideas. We start with a “soup” of layers, all floating 
around and not connected to anything, as in Figure 24.95(a). Then 
we decide to build some connections from one layer to the next, as 
in Figure 24.95(b). That set of connections is a model, as in Figure 
24.95(c). Later we might build a second set of connections, making 
a second model. We’re re-using the same layers in each model, just 
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changing where their inputs come from and where the outputs go. 
When any layer learns, that change will be incorporated into any other 
model that layer is used in. 


Model 1 Model 2 t 
gy OM) me GAS) | Gre 
XK x 
| | | 
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Figure 24.95: In the Functional API, we distinguish the layers from how 
they’re connected. Left: A “soup” of layers. Middle: Connections between 
layers makes a model. Right: A different set of connections between 
the layers at the far left that compose another model. Any training that 
happens in any model stays with the layers that learned it, so the other 
model will benefit from those improved weights. 


This flexibility in connections and re-use of layers allows us to perform 
a useful operation called pre-training. In this technique, we teach a 
piece of a network in isolation before we teach the whole thing. The 
idea is to build a small network with just the layers we want to train, 
and teach that for a while. This is a useful technique when we antici- 
pate that one piece of the network is going to be take much more time 
to train than the rest of it. We can teach the difficult part first, in iso- 
lation, which will often be much faster than training the whole thing. 
When those layers have become good at their jobs, we then connect 
them to the larger model and train up the whole thing. 


This idea generalizes even to the level of the model. We can create a 
model, and then make a second model that includes the first one. We 
don’t even need to explicitly include all the layers. Keras allows us to 
place one model into another by name, just as if it was a layer. 
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24.6.1 Input Layers 


The Functional API supports all the layers offered by the Sequential 
API, including those we’ve seen above. But it requires one additional 
layer to act as the input layer. 


Recall that the input layer is usually implicit, since it’s just a place to 
“park” the incoming data. When we make a sequential model, we tell 
Keras the size of the input layer with the input_shape argument on 
the first layer, and Keras makes an input layer for us that’s the right 
size to hold one sample of that shape. 


In the functional API, it’s our job to create that layer explicitly and add 
it into the model. 


The new layer is called an Input layer, and it takes one argument called 
shape that tells it the structure of the input. This is identical in use to 
input_shape, so it’s unfortunate that they have different names. 


Let’s think back to the MNIST data set. To create an input for a piece 
of flattened MNIST data containing a list of 784 elements, we could 
write the code in Listing 24.78. 


input_layer = Input(shape=[784] ) 
Listing 24.78: Creating an input layer for 784 elements. 


A common alternative way to write a one-element list in Python is 
(784, ), where the comma tells the system that this isn’t just the num- 
ber 784 in parentheses, but a list with one element. 


Let’s suppose that our first hidden layer, immediately following the 
Input layer, is a convolution layer, which takes as input a tensor of 
shape 28 by 28 by 1. We can create the input layer with Listing 24.79. 
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input_layer = Input(shape=[28,28,1]) 
Listing 24.79: Creating an input layer fora tensor of 28 by 28 by1 elements. 


Now that we have our input layer, we can move on to making a model. 


24.6.2 Making A Functional Model 


To build a model in the Functional API, we create each layer as its own 
object, and then add it to the model by identifying where it gets its 
input from. 


Our first job is to make a layer, which we save in a variable. To use it in 
a model, we need to specify its inputs. We don’t need to specify where 
the output goes, because the system can figure that out from the other 
layers. Let’s say our current layer is named layer_1, and later we cre- 
ate a layer named layer_2 that says it gets input from layer_1. Then 
it’s easy to work out that layer_2 is one of the outputs of layer_1. If 
we like, multiple layers can take their input from layer_1. 


Let’s imagine a simple model that takes a flattened list of 784 values 
as input, and returns a single number giving the probability that the 
image is an MNIST-style digit. We’re not categorizing the inputs here, 
but rather just presenting a single value at the output that tells us if 
the input is or is not an MNIST digit. Let’s propose trying the network 
of Figure 24.96 for the job. We could easily make this model with the 
Sequential API, but let’s use the Functional API to see how it’s done. 
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784 inputs Nd S< S< 


1000 500 1 
ReLu ReLu sigmoid 


Figure 24.96: A simple network to tell us if an input is an MNIST-style 
digit image. 


To create this model, we need an Input layer and three Dense layers. 
Listing 24.80 makes the layers and saves each in a variable. We'll call 
these “unconnected layers”, since they’re not connected to anything. 


input = Input(shape=(784, ) ) 

dense_1 = Dense(1000, activation='relu' ) 
dense_2 = Dense(500, activation='relu' ) 
output = Dense(1, activation='sigmoid’ ) 


Listing 24.80: Creating the layers for our network of Figure 24.96. 


Now we heed to connect these layers. We'll create a new object called 
a “connection layer.” Like some of the layers we’ve seen before, this 
is really just a wrapper or container. A connection layer points to two 
objects: an unconnected layer, and another connection layer. This lets 
us build up a chain of connection layers that define a whole network. 


Let’s see how this works. On the right of Figure 24.97 we see the four 
unconnected layers we just made in Listing 24.80. Let’s start building 
connection layers at the bottom, with the input layer. The input layer 
is a special case that doesn’t take input from any other layer. As we 
said, the connection layer points to two objects. First there’s the layer 
it’s referring to, which in this case is the input layer. The other object 
is the connection layer that provides the input to this layer. Since this 
layer doesn’t take input, we leave that pointer empty. We just built our 
first connection layer. We'll call this C1_input, where C refers to this 
being a connection layer, and the 1 distinguishes it from other such 
layers we'll make later. 
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C1_ output 


C1 dense 2 


C1 _dense_1 


C1_input 





Figure 24.97: Building connection layers. On the right are the original, 
unconnected layers. On the left are the connection layers. Each connec- 
tion layer points to an unconnected layer, and to the preceding connec- 
tion layer. The connection layer for the input layer is a special case. 


Let’s move up now to the first hidden layer, which we called dense_1. 
To build its connection layer, we point its first value at the unconnected 
layer dense_1. Then we point it at the connection layer that provides 
dense_1 with input. This is C1_input, which we just made. That’s it 
for this connection layer, which we'll call C1_dense_1. 


Moving upwards, we repeat the process for the next two layers. 


This chain of layers is everything we need to build a model. When 
Keras writes the code to route the output of each layer to some other 
layer, it just follows the chain of connection layers. 


The key thing is that the connection layers aren’t duplicating the 
unconnected layers. If we make another connection layer that points 
to, say, dense_2, then we’re not modifying dense_2 in any way. 
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Listing 24.81 shows the code for Figure 24.97. We make our connec- 
tion layers implicitly by treating each unconnected layer like a function, 
and giving it the argument of the connection layer it points to. The 
syntax can seem a little weird. Keras makes it easy to make the con- 
nection layer C1_input, which only needs to point to the unconnected 
input layer. 


# The tnput layer has no previous connections 
Cl_input = input 

C1_dense_1 = dense_1(C1_input) 

C1_dense_2 = dense_2(C1_dense_1) 

C1_output = output(C1l_dense_2) 


Listing 24.81: Building our connected layers, combining a layer with its 
connection to a previous layer. Only input_layer does not have an 
input. 


Now we have four new variables, each prefixed with c1_. Each one tells 
us about a layer and where it gets its input from. We can make a model 
out of these connection layers by calling Model() with the input and 
output connection layers as arguments, as in Listing 24.82. 


network_1 = Model(C1_input, C1l_output) 


Listing 24.82: Making our model based on the input and output connec- 
tion layers. The other layers are included implicitly. 


Notice that we don’t have to specify all the connection layers between 
the input and the output. Keras can figure out that theyre needed 
because it can follow the chain of connected layers backwards from 
the output layer to the input layer. 


Now that we have our model, we can treat it just like the sequential 
models we saw above. So we'll compile() to actually build the model 
and then fit() to train it. 
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Let’s say that later we want to play around with this architecture a lit- 
tle. Maybe we’d improve performance by replacing the next-to-final 
stage of processing in the dense_2 layer with a convolution layer fol- 
lowed by a flatten layer. We don’t want to train from scratch, since our 
first dense layer and our output layer (also a dense layer) are the same. 
We'd like to make a second model out of the pieces of the first, but we 
don’t want to disassemble the first model. 


This is easy with the Functional API. We start by making a couple of 
new unconnected layers (our convolution and flatten layers). Now we 
build up a new set of connection layers, as shown in Figure 24.98. The 
first two layers, and the last, will re-use the layers from our first model. 
All of their learned weights come along for the ride. 


C2_output 


output C2_flatten_1 


flatten_1 


dense _ 2 C2 _convo_1 


dense_1 C2 dense_1 


alle 
Pear ro 


input C2_input 


Figure 24.98: Building connections. 


We can even flip back and forth, training the first model for a little 
while, then the second, then back to the first. 


The code for this new model is shown in Listing 24.83. 
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# define the new layers 

convo_1 = Conv2D(32, (5,5)) 
flotten_1 = Flatten() 

# Build the new connection layers 
C2_input = input 

C2_dense_1 = dense_1(C2_input) 
C2_convo_1 = convo_1(C2_dense_1) 
C2_flatten_1 = flatten_1(C2_convo_1) 
C2_output = output(C2_flatten_1) 

# build the model 

model2 = Model(C2_input, C2_output) 


Listing 24.83: Building our new model using some layers from the first 
model. 


Sometimes we don’t want to change the shared layers. For exam- 
ple, we might find that dense_1 changes considerably depending on 
whether we're training the first or second model. Perhaps we want to 
keep all the layers in the first model where they are, but just train the 
two new layers in our second model. In that case we can use the freez- 
ing mechanism to prevent any layer from changing. We just set any 
layer’s optional parameter trainable to False to freeze it, and then 
set it to True if we want to make it trainable again later. 


We started this section by considering the difficulty of using the 
Sequential API to build a branching architecture such as in Figure 
24.93 Figure 24.99(a) shows a version of such a model. 
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Figure 24.99: A branching architecture using the Functional API. (a) The 
architecture we'd like to make. (b) A set of connection layers to represent 
it. 


Figure 24.99(b) shows the structure of a set of connection layers that 
can do the job. Note that the connection layers for both convolution 
layers get their input from the same connection layer, associated with 
dense_2. This kind of branching doesn’t require any special effort 
when using the Functional API, since we just point our connection lay- 
ers where we want them to go. 


Figure 24.99(b) has a layer called add that we haven’t discussed. Keras 
offers us a variety of layers that can combine multiple layers. They’re 
called “Merge Layers,” and the Keras documentation lists a half-dozen 
possibilities for us [Chollet17a]. Each of these layers takes a list of other 
layers in the model, and combines their outputs. We can also write our 
own custom layers if we need something that’s not already available. 
In this case, the layer adds together the output tensors of the two con- 
volution layers. Of course, they must be the same size for this to make 
sense. 
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We’ve seen that the Functional API lets us re-use layers in multiple 
models, and build models whose connections are not simply a single 
stack of layers. 


Guidance for building more complex structures can be found online in 
various blogs and GitHub repos, as well as the Keras documentation 
itself. 
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Autoencoders 


An autoencoder learns how to 

represent its input in a compact form. 

We can make variations of that compressed 
version to generate new data that is like the inputs. 


Chapter 25: Autoencoders 
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25.1 Why This Chapter Is Here 


This chapter is about a particular kind of learning architecture called 
an autoencoder. 


One way to think about a standard autoencoder is that it’s a mecha- 
nism for learning how to compress its input into a form that we can 
then decompress to recover a version of the input (usually the decom- 
pressed output is in some way a degraded version of the input). 


In other words, an autoencoder learns how to compress its input so it 
takes up less disk space and can be communicated more quickly, much 
as an MP3 encoder compresses music, or a JPG encoder compresses 
an image. Unlike these highly-tuned algorithms, an autoencoder can 
find a compression scheme for any input. 


The compression method is attuned to the particularities of the train- 
ing set, so it’s not for making general-purpose encoders. That’s why 
JPG and MP3 have nothing to fear from autoencoders. If we trained 
an autoencoder on 3-minute popular music songs, and then tried to 
compress a 40-minute symphony, the results would not be great. 


In practice, we usually use autoencoders for two types of jobs: remov- 
ing the noise from a dataset, and finding a way to automatically reduce 
the dimensionality of a dataset. We can use it for straight compression 
and decompression, and sometimes autoencoders are used that way, 
but we can usually get better results by developing a special-purpose 
algorithm, like JPG or MP3, tailored to the specific type of data we’re 
trying to compress. 


The value of removing noise and reducing dimensionality is that the 
new dataset often results in faster training and better results, com- 
pared to the original dataset. 


A special kind of autoencoder is called the variational autoencoder, 
or VAE. Although still an autoencoder, it works on different principles 
than most “standard” autoencoders, and gives us a nice new feature: if 
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we feed random numbers into the second half of a VAE, we can gen- 
erate an unlimited amount of new data that is like, but not a direct 
copy of, the input data. Even better, we can smoothly blend from one 
generated output to another. 


Autoencoders can be applied to any kind of data, such as images, 
sounds, movie frames, weather data, or any other abstract or concrete 
data we’d like to remove noise from, compress for later training, or 
generate more instances of. 


25.2 Introduction 


Compression is a useful tool in many applications. 


One key development that led to the massive popularity of digital music 
is the MP3 encoding standard. This is an algorithm that can compress 
music data significantly, often by 10 times or more over the original 
recorded format while still sounding acceptable [Wikipedia16c]. When 
digital music required burning physical CDs, being able to pack ten 
times as much music on each disk was an enormous advantage. And of 
course it also means more music in the expensive solid-state memory 
found in most portable MP3 players (including tablets and phones). 


A similar story can be told for images, which benefited from the JPG (or 
JPEG) algorithm. In many cases a photograph could be compressed to 
a file with 1/15 or even 1/20 of its original number of bytes, and still 
look good [Wikipediai6a]. This meant images could be included in 
web pages, because their small size enabled them to be communicated 
quickly. This helped lead to the image-rich web that we enjoy today. 


We say that both MP3 and JPG take an input (a music or image file), 
and encode it into a compressed form that takes up less space. Then 
we decode or decompress that intermediate version to recover some 
version of the original. The higher the quality of the compression, the 
more the decompressed version is like the original in ways that matter 
to us for a specific application. 
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The MP3 and JPG encoders are entirely different, but they share a fea- 
ture: both are examples of lossy encoding. Let’s see what this means. 


25.2.1 Lossless and Lossy Encoding 


In previous chapters we’ve used the word loss as near synonym for 
error, So our network’s error function was also called its loss function. 


In this section we'll use the word with a slightly different emphasis, 
referring to the degradation of a piece of data that has been com- 
pressed and then decompressed. The greater the mismatch between 
the original and the decompressed version, the greater the loss. 


The idea of loss, or degradation of the input, is distinct from the idea 
of making the input smaller. For example, In Chapter 6 we saw how 
to use Morse code to carry information. The translation of letters to 
Morse code symbols carries no loss, because no information is lost. 
We say that converting, or encoding, our message into Morse code is a 
lossless transformation, because nothing is lost. We’re just changing 
format, like changing a book’s typeface or type color. 


To see where loss can get involved, let’s suppose that we’re camping in 
the mountains. On a nearby mountain our friend Sara is enjoying her 
birthday. We don’t have radios or phones, but both groups have mir- 
rors, and we've found we can communicate between the mountains by 
reflecting sunlight off our mirrors, sending Morse code back and forth. 


Suppose that we want to send the message, “SARA HAPPY BIRTHDAY 
BEST WISHES FROM DIANA” (for simplicity, we'll leave out punc- 
tuation). Counting spaces, that’s 42 characters. That’s a lot of 
mirror-wiggling. So we decide to leave out the vowels, and send “SR 
HPP BRTHD BST WSHS FRM DN?” instead. That’s only 28 letters, so 
we can send this in about 2/3 the time of the full message. We say the 
original message has been compressed. 
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Our new message has lost some information (the vowels) when it was 
compressed. We can’t reconstruct the original version by just making 
direct substitutions in this message. In other words, there’s less infor- 
mation, in a technical sense, in our vowel-less message than in our 
original. We capture this idea by saying that the compression method 
that leaves out vowels is a lossy method. 


We can’t make a blanket statement about whether it is or isn’t okay 
to lose some information from any message. If there is loss, then the 
amount of loss we can tolerate depends on the message and all the 
context around it. 


For example, suppose that our friend Sara on the other mountain is 
camping with her friend Suri, and it just happens that they share a 
birthday. In this context, “HPP BRTHD” is probably unambiguous. But 
“SR” can cause confusion, because they can’t tell who we’re addressing. 
And if we’re sharing our mountain with Dan, they won’t know if the 
message was sent by Diana or Dan. 


That’s why context matters. If Sara was camping with Bob and Mary, 
then “SR” would be perfectly clear, and if we’re camping with Howard, 
then “DN” would also be unambiguous. 


So by dropping the vowels we shortened our message, but at the 
expense of losing some of its information. 


An easy way to test if a transformation is lossy or lossless is to consider 
if it can be inverted, or run backwards, to provide us with the origi- 
nal data. In the case of standard Morse code, we can turn our letters 
into dot-dash patterns and then back to letters again with nothing lost 
in the process. But the version without the vowels has lost those let- 
ters forever. We might be able to guess at them and put our guesses 
back in, but we’re only guessing and we could get it wrong, as with the 
names we discussed. When we compressed our message by removing 
the vowels, our compression was lossy. 
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25.2.2 Domain Encoding 


Both MP3 and JPG are lossy systems for compressing data. In fact, 
they’re very lossy, often throwing away 90% or more of the original 
information. But both of those algorithms were very carefully designed 
to throw away just the “right” 90%, so that in many cases, we can’t tell 
the compressed version from the original (or maybe we can tell, but 
they’re still pretty close). 


This was achieved by carefully studying the properties of each kind of 
data, and how it was perceived. The MP3 data is based not just on the 
properties of sound, but on the properties of music and of the human 
auditory system. In the same way, the JPG algorithm is not only spe- 
cialized towards the structure of data within images, but it also includes 
information about the human visual system. 


In an impossible world, compressed files would be tiny, and their 
decompressed versions would match their corresponding originals 
perfectly. In the real world, we trade off the fidelity, or accuracy, of 
the decompressed image (how well it matches our original) with the 
file size. Generally speaking, the bigger the file, the better the match of 
the decompressed file to the original. This makes perfect sense when 
we observe that the file’s size corresponds to how much information it 
holds. Smaller files can hold less total information than larger files. 


The designers of lossy compression algorithms work hard to selectively 
lose just the information that matters to us the least for that particu- 
lar type of file. Often this question of “what matters” to a person is an 

issue of debate, leading to a variety of different lossy encoders (such as 

FLAC and AAC for audio, and TIFF and JPG for images). 


Just for fun, let’s compress an image with an MP3 encoder and see 
what happens. To keep things simple, we'll use a grayscale image. We 
turned a picture of a tiger into a sound file by writing out one row after 
another from top to bottom. Then we compressed the sound file with 
MP3, and then ran the process backwards to recover an image. Figure 
25.1 shows a grayscale tiger before and after extreme MP3 compression 


1386 


Chapter 25: Autoencoders 


(a constant 8k bits per second). The MP3 encoder does a surprisingly 
good job. We also show the image after similarly extreme JPG com- 
pression (with quality set to 0). 
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Figure 25.1: Compressing an image with an MP3 encoder. Upper-left: The 
original grayscale tiger at 512 by 512 resolution. Upper-middle: The tiger 
after extreme MP3 compression at 8k bytes per second. Upper-right: 
The tiger after extreme JPG compression. Upper horizontal strip: The 
audio file corresponding to the tiger before MP3 compression. Bottom 
horizontal strip: A close-up of the start of the original audio file, showing 
about the first 17 rows starting from the top. 


—=_ 
= 


The sharp dips in the close-up audio at the bottom of Figure 25.1 corre- 
spond to the black tip of the tiger’s ear. When played, the audio sounds 
like the propellers of a helicopter hovering overhead. The original, 
MP3, and JPG images respectively contain about 262,000, 37,000, and 
26,000 bytes, so the MP3 file was compressed by about 85%, and the 
JPG by about 90%. Remarkably, both compressed versions look pretty 
good, even in details like the whiskers, thought the MP3 version seems 
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to have lightened up the image. This demonstrates how well-designed 
the JPG and MP3 algorithms are, even when pushed hard on data they 
weren't intended to see. 


Figure 25.2 shows close-ups of the tiger’s eye in each of the three 
images. 





Figure 25.2: Close-ups of the image of Figure 25.1. Left: The original tiger 
Middle: The MP3 compressed version. Right: The JPG image. 


Note the streaking in the middle image in Figure 25.2, showing the 
nature of MP3’s assumption that it’s working with time-ordered sound 
data (the data was saved as sound one row at a time, so each row rep- 
resents a brief piece of sound, followed by the sound on the next row). 
The far right image shows the blocky structure at the heart of JPG’s 
assumption that it’s working with image data. 


The MP3 file looks noisy and choppy, compared to the original and 
the JPG versions. This isn’t surprising, since MP3 wasn’t intended to 
encode images. It’s impressive that it looks as good as it does, because 
it was throwing away information believing that it represented a lin- 
ear sequence of sound amplitudes intended for human ears, not a 2D 
arrangement of pixels intended for human eyes. 


25.2.3 Blending Representations 


Later in this chapter we’re going to find numerical representations of 
our inputs, and we'll blend those to create new data that is like the 
input data, but different. 


There are two general approaches to blending data. 
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The first approach can be described as content blending. That’s 
where we blend the content of two pieces of data with each other. For 
example, if we place a cow and zebra over one another, we get some- 
thing like Figure 25.3. 





Figure 25.3: Content blending a cow and zebra. Blending each by 50% 
gives us a superposition of the two images, not a half-cow, half-zebra 
animal. 


The result is a combination of the two images, not an in-between ani- 
mal that is half cow and half zebra. 


To get that, we would use a second approach, called parametric 
blending, or representation blending. Here we work with param- 
eters that describe the thing we're interested in. By blending two sets 
of parameters, depending on the nature of the parameters and the 
algorithm we use to create the object, we can create results that blend 
the inherent qualities of the things themselves. 


For example, suppose we have two circles, each described by a center, 
radius, and color, as in Figure 25.4. 
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Figure 25.4: Two circles we‘ like to blend. 


If we use content blending on the two images, we get each circle at 
half-intensity, as in Figure 25.5. 








Figure 25.5: Content blending the two circles means using 50% of the 
image of each. 





But if we blend the parameters (that is, we blend the two values rep- 
resenting the X component of the circle with each other, and the two 


values for Y, and similarly for radius and color) then we get in-between 
circles, as in Figure 25.6. 
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Figure 25.6: Parametric blending of the two circles means blending their 


parameters (center, radius, and color). Here we show intermediate blends 
at 25%, 50%, and 75% between the two starting circles. 








This works well for uncompressed objects. But if we try this with com- 
pressed objects, we rarely get reasonable in-between results. The 
problem is that the compressed form may have little in common with 
the internal structure that we need to meaningfully blend the objects. 


For example, let’s take the sounds of the words “cherry” and “orange.” 
These are our objects. We can blend these sounds together by having 
two people say the words at the same time, creating the audio version 
of our cow and zebra in Figure 25.3. 


We can think of turning these sounds into written language as a form 
of compression. For example, if it takes a half-second to say the word 
“cherry”, then using MP3 at a popular compression setting of 128 Kbps, 
we'd need about 8,000 bytes [AudioMountain16]. Using the simple but 
popular format of one byte per letter, the written form would require 
only 6 bytes, which is vastly smaller. 


Since the letters are drawn from the alphabet, which has a given order- 
ing, we can blend the representations by blending the letters through 
the alphabet. This isn’t going to work for letters, but let’s follow the 
process through because a version of this will work for us later. 
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The first letters of “cherry” and “orange” are C and O. In the alpha- 
bet, the region spanned by these letters is CDEFGHIJKLMNO. Right 
in the middle is the letter I, so that’s the first letter of our blend. When 
the first letter appears later in the alphabet than the second, as in E to 
A, we count backwards. When there’s an even number of letters in the 
span, we’ve chosen the earlier one. As shown in Figure 25.7 this blend- 
ing gives us the sequence IMPCMO. 











C —— DEFGHIJKLMN ——0O 
H IJKLMNOPQ R 
E ——————_ DCB —————-A 
R — QPO ——_——N 
R QPONMLKJ IH ——G 


YXWVUTSROPONMLKJ IHGFE 


Figure 25.7: Blending the written words “cherry” and “orange” by finding 
the midpoint of each letter in the alphabet. In the case of R and G we 
choose the earliest of the two letters in the middle. 


What we wanted was something that, when uncompressed, sounded 
like a blend between the sound of “cherry” and the sound of “orange.” 
Saying the string “impcmo” out loud definitely does not satisfy that 
goal. Beyond that, it’s a meaningless string of letters that doesn’t cor- 
respond to any fruit, or even any word at all in English. 


In this case, blending the compressed representations doesn’t give us 
anything like the blended objects. 


We'll see that a remarkable feature of autoencoders, and in particular 
the variational autoencoder we'll see at the end of the chapter, is that 
they do allow us to blend the compressed versions, and (to a point) 
recover blended versions of the original data. 
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25.3 The Simplest Autoencoder 


We can build a deep learning system to figure out a compression 
scheme for any data we want. The key idea is to create a place in the 
network where the entire data has to represented by fewer numbers 
than there are in the input. That, after all, is what compression is all 
about. 


For instance, let’s suppose that our input consists of 100 by 100 gray- 
scale images of different animals, and we'd like to build a system to 
compress them. Each image has 100x100=10,000 pixels, so our input 
layer will hold 10,000 numbers. Let’s arbitrarily say we’d like to find 
the best way to represent those images using only 20 numbers. 


One way to do this is to build a network as in Figure 25.8. It’s just one 
layer! 


Figure 25.8: Our first encoding stage is a single dense, or fully-connected, 
layer that turns 10,000 inputs into 20 numbers. 


10,000 numbers <i 20 numbers 
20 


Our input is 10,000 elements, going into a fully-connected layer of 
only 20 neurons. The output of those neurons for any given input is 
our compressed version of that image. In other words, with just one 
layer, we’ve built an encoder. 


The real trick now would be to be able to recover the original 10,000 
pixel values, or even anything close to them, starting from just these 
20 numbers. 
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To do that we immediately follow the encoder with a decoder, as in 
Figure 25.9. In this case, we'll just make a fully-connected layer with 
10,000 neurons, one for each output pixel. 


10,000 numbers 10,000 numbers 


20 10,000 


Figure 25.9: An encoder (in blue) turns our 10,000 inputs into 20 vari- 
ables, then a decoder (in beige) turns those back into 10,000 values. 


Because the amount of data is 10,000 elements at the start, 20 in the 
middle, and 10,000 again at the end, we say that we’ve created a bot- 
tleneck, because of the similarity of the encoder stage to a bottle with 
a smaller neck than base (the decoder has the same shape, but in the 
other direction). Figure 25.10 shows the idea. 


flow of data 
——l|_ oe 0  _————_ 








input output 


20 wide 
10,000 wide 10,000 wide 





Figure 25.10: We say the middle of a network like Figure 25.9 is a “bottle- 
neck” because it’s shaped like a bottle with a narrow top. 


Now we can train our system. Each input image is also the output tar- 
get. So the system tries to find the best way to crunch the input into just 

20 numbers that can be un-crunched to best match the target, which is 

the input itself. 


This is an autoencoder. It gets that name because it automatically 
finds the best way to encode the input so that the decoded version is as 
close as possible to the input. 
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The compressed representation at the bottleneck is called the code, or 
the latent variables (the word “latent” comes from the Latin word 
lateo, meaning to “lie hidden” [Wikipedia16b]). 


Usually we make the bottleneck using a small layer in a deep network. 
Naturally enough, this layer is often called the latent layer or the 
bottleneck layer. The outputs of the neurons on this layer are called 
the latent variables. The idea is that these “variables” represent the 
image in some way. 


Notice something odd about this design: there are no category labels 
(as with a categorizer) or targets (as with a regression model). We 
don’t have any other information for the system other than the input 
we want it to compress and then decompress. 


We say that an autoencoder is an example of semi-supervised 
learning. It sort-of is supervised learning because we give the system 
explicit goal data (the output should be the same as the input), and it 
sort-of isn’t supervised learning because we don’t have any manually 
determined labels or targets on the inputs. 


Figure 25.9 is the simplest version of an autoencoder. Let’s train it on 
our tiger and see how it does. We'll feed it the tiger image over and 
over for a long time, using the loss function to encourage it to output 
a full-size image of the tiger, despite the compression down to just 20 
numbers at the bottleneck. 


We'll train it until it seems to not be improving any more. Figure 25.11 
shows the result. Each error value shown on the far right is the origi- 
nal pixel value minus the corresponding output pixel value (the pixels 
were scaled to the range [0,1] as usual). 
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Figure 25.11: Training our autoencoder of Figure 25.9 on just the tiger. 
Left: The input tiger. Middle: The output. Right: The pixel by pixel differ- 
ences between the original and output tiger. The autoencoder seems to 
have done an amazing job, since the bottleneck had only 20 numbers. 


This is fantastic! Our system took a picture composed of 10,000 pixel 
values, crunched them down to 20 numbers, and then practically 
recovered the entire picture again, right down to the thin wispy whis- 
kers. The pixels range from 0 to 1, and the biggest error in any pixel 
was about 1 part in 100. It looks like we’ve found a fantastic way to do 
compression! 


Wait a second. That doesn’t make sense. There’s just no way to rebuild 

that tiger image from 20 numbers without doing something sneaky. In 

this case, the sneaky thing is that the network has memorized the image. 
It simply set up all 10,000 output neurons to take those 20 input num- 
bers and reconstruct the original 10,000 input values. In other words, 
the whole network can do only one thing: output this specific tiger 

image. We didn’t really compress anything at all. Each of the 10,000 

inputs went to each of the 20 neurons in the bottle neck layer, requiring 

20 x 10,000 = 200,000 weights, and then the 20 bottleneck results 

all went to each of the 10,000 neurons in the output layer, requiring 

another 200,000 weights, whose output was the picture of the tiger. 
We basically found a way to store 10,000 numbers using only 400,000 

numbers. Um, hooray? 


In fact, most of those numbers are irrelevant. Remember that each 
neuron has a bias that’s added alongside the incoming weights. The 
output neurons are relying mostly their bias values, and not too much 
on the inputs. To test this, Figure 25.12 shows the result of giving the 
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autoencoder a picture of a flight of stairs. It doesn’t do a poor job of 
compressing and decompressing the stairs. Instead, it mostly ignores 
the stairs, and gives us back the memorized tiger. The output isn’t 
exactly the input tiger, as shown by the rightmost image, but just look- 
ing at the output it’s hard to see what the stairs might be contributing 
to the output. 





Figure 25.12: Left: Presenting our tiny autoencoder trained on just the 
tiger with an image of a stairway. Middle: The output is the tiger! Right: 
The difference between the output image and the original tiger. 


The error bar on the right of Figure 25.12 shows that our errors are 
much larger than those of Figure 25.11, but the tiger still looks a lot 
like the original. 


Let’s make a real stress test of the idea that the network is mostly rely- 
ing on the bias values. We’ll feed the autoencoder an input image that 
is O everywhere. Then there are no input values to work with, and only 
the bias values contribute to the output. Figure 25.13 shows the result. 
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Figure 25.13: When we give our tiny autoencoder a field of pure black, 
it uses the bias values to give us back a low quality, but recognizable, 
version of the tiger. Left: The black input. Middle: The output. Right: The 
difference between the output and the original tiger. Note that the range 
of differences runs from O to almost 1, unlike Figure 25.12 where they ran 
from about -0.4 to 0. 


No matter what input we give to our network, we'll get back some 
version of the tiger as output. The autoencoder has trained itself to 
produce the tiger every time. 


A real test of the autoencoder would be to teach it a bunch of images 
and see how well it compresses them. We used a set of 25 photographs, 
shown in Figure 25.14. 
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Figure 25.14: The 25 photographs that we used, in addition to the tiger, 
to train our tiny autoencoder. Each image was rotated by 90°, 180°, and 
270° during training, so this set was effectively 100 images. 


We made the database larger by training not just on each image, but 
also on each image rotated by 90, 180, and 270°. So our training set 
was the tiger (and its three rotations) and the 100 images of Figure 
25.14 with rotations, for a total of 104 images. 


Now that the system is trying to remember how to represent all 104 
of these pictures with just 20 numbers, it should be no surprise that 
it can’t do a very good job. Figure 25.15 shows what this autoencoder 
produces when we ask it to compress and decompress the tiger. 
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Figure 25.15: We trained our autoencoder of Figure 25.9 with the 100 
images of Figure 25.14 (each image plus its rotated versions), along with 
the four rotations of the tiger. Using this training, we gave it the tiger on 
the left, and it produced the output in the middle. 


Now that the system isn’t allowed to cheat, the result doesn’t look like 
a tiger at all. We can see a little bit of 4-way rotational symmetry in the 
result owing to our training on the rotated versions of the input images. 


Figure 25.15 makes a lot more sense. 


We could do better by increasing the number of neurons in the bot- 
tleneck, or latent, layer. But since we want to compress our inputs as 
much as possible, adding more values to the bottleneck should be a 
last resort. We’d rather do the best possible job we can with as few val- 
ues as we can get away with. 


Let’s try to improve the performance by considering a more complex 
architecture than just the two dense layers we’ve been using so far. 


25.4 A Better Autoencoder 


In this section, we'll explore a variety of autoencoder architectures. To 
compare them we'll use the MNIST database we saw in Chapter 21. To 
recap, this is a big, free database of hand-drawn, grayscale digits from 
O to 9. Figure 25.16 shows some typical digit images from the MNIST 
dataset. 
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Figure 25.16: A sampling of the hand-written digits from the MNIST 


dataset. 


20 784 


To run our simple autoencoder on this data we'll need to change the 
size of the inputs and outputs of Figure 25.9 to fit the MNIST data. 
The input and output layers will now have 784 elements instead of 
10,000, since 28 x 28 = 784. We'll leave the bottleneck at 20. Figure 
25.17 shows the structure. 


Figure 25.17: Our 2-layer autoencoder for MNIST data. The encoder is 
the first hidden layer with 20 neurons, and the decoder is the output 


layer with 784 neurons. 
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We'll train this for 50 epochs (that is, we’ll run through all 60,000 
training examples 50 times). We'll hope that, compared to Figure 25.14, 
this big training set of simple images will give us better performance. 


Some results are shown in Figure 25.18. 
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Figure 25.18: Running 5 digits from the MNIST dataset through our 
trained autoencoder of Figure 25.17 which uses 20 latent variables. Top 
row: The input data. Bottom row: The reconstructed images. 


Figure 25.18 is pretty amazing. Our two-layer network learned how to 
take each input of 784 pixels, squash it down to just 20 numbers, and 
then blow it back up to 784 pixels. The resulting digits are blurry, but 
recognizable. 


Let’s try reducing the number of latent variables down to 10. We expect 
things are going to look at lot worse. Figure 25.19 shows that they are 
indeed worse. 
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Figure 25.19: Top row: the original MNIST images. Bottom row: the output 
of our autoencoder using 10 latent variables. 


1402 


Chapter 25: Autoencoders 


This is getting pretty bad. The 2 seems to be turning into a 3 with a 
bite taken out of it, and the 4 seems to be turning into a 9. But that’s 
what we get for crushing these images down to 10 numbers. That’s just 
not enough to help the system completely capture the input. 


Let’s make things ridiculous and lower the number of latent variables 
down to 3. Figure 25.20 shows the results. 


ZA / oly 
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Figure 25.20: Top row: the original MNIST images. Bottom row: the 
output of our autoencoder using 3 latent variables. 


And that’s pretty dreadful. Remarkably, even though we're certainly 
not recreating the inputs faithfully, we’re still getting back things that 
look like blurry, mashed-up digits. 


The lesson is that our autoencoder needs to have both enough compu- 
tational power (that is, enough weights) to figure out how to encode 
the data, and enough latent variables (that is, intermediate values) to 
find a useful compressed representation of the input. 


Let’s see how deeper models will perform. The encoder and decoder 
can be built with any deep learning strategy we like: multi-layer per- 
ceptrons for general data, convolution layers for images, and even 
recurrent neural networks for sequential data. We can make deep auto- 
encoders with lots of layers, or shallow ones with only a few, depending 
on our data. 


1403 


Chapter 25: Autoencoders 


For now, let’s continue using dense layers, but we'll add some more 
of them to create a deep autoencoder. We'll make the encoder stage 
from several hidden layers of decreasing size until we reach the bottle- 
neck, and then we'll build a decoder from several more hidden layers 
of increasing size until they reach the same size as the input. 


Figure 25.21 shows this approach where now we have three layers of 
encoding and three of decoding. 
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Figure 25.21: A deep autoencoder built out of fully-connected (or dense) 
layers. Blue icons: A 3 layer encoder. Beige icons: A 3 layer decoder. 


We often build these fully-connected layers so that they decrease (and 
then increase) by a multiple of two, as when we go between 512 and 
256. That choice seems to work out well, but there’s no rule enforcing 
it. 

We'll train this autoencoder just like all the others, for 50 epochs. 
Figure 25.22 shows the results. 
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Figure 25.22: Predictions from our deep autoencoder of Figure 25.21. 
Top row: Images from the MNIST test set. Bottom row: Output from our 
trained autoencoder when presented with the test digits. 
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The results are just a little blurry, but they match the originals unam- 
biguously. Compare these results to Figure 25.18, which also used 20 
latent variables. These images are much clearer. So by providing addi- 
tional compute power to find those variables (in the encoder), and 
extra power in turning them back into images (in the decoder), we’ve 
gotten much better results out of our 20 latent variables. 


25.5 Exploring the Autoencoder 


Let’s get to know autoencoders better by looking more closely at the 
results from our latest network in Figure 25.21. 


25.5.1 A Closer Look at the Latent Variables 


We've seen that the latent variables are the compressed form of the 
inputs, but we haven’t looked at the latent variables themselves. Figure 
25.23 shows graphs of the 20 latent variables, and the images that the 
decoder constructs from them. 
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Figure 25.23: Top row: Five images from the MNIST database. Middle 
row: The 20 latent variables for each image. Bottom row: The digits 
decompressed from the latent variables on the first row. 
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How sensitive is the system to these exact values? Let’s try adding a 
little bit of noise to them. It would be great if tweaking the values just 
a little also tweaked the shape, perhaps putting a loop into the bot- 
tom-left of the 2, or closing the top of the 4. 


Let’s try it and see. 


Looking at Figure 25.23 we can see that the latent values are in the 
range O to 600, so let’s add a random number from -1 to 1 to each 
latent value, and then use the decoder to create the image correspond- 
ing to these shifted values. The results are in Figure 25.24. 
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Figure 25.24: The results of adding noise between -1 and 1 to each of 
the latent variables in Figure 25.23 for each digit. It’s hard to see any 
difference. 





There’s no visible difference between these images than those in Figure 
25.22, so it seems that adding this little bit of noise hasn’t had much 
effect. Let’s crank up the noise to values from —10 to 10. The results 


are in Figure 25.25. 
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Figure 25.25: We've added a random value from -10 to 10 to every latent 
variable in Figure 25.23 These look worse than before. 
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The digits are definitely showing some wear and tear. The edges are 
more jagged and rough, and the 1 is eroding quite a bit. but mostly the 
digits are still recognizable. 
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Let’s try adding values from —100 to 100 to the latent variables. Figure 


25.26 shows the result. 
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Figure 25.26: Adding noise from -100 to 100 to each latent variable. 
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Things are looking pretty terrible. It’s hard to even guess that these 
were digits. Adding values of —100 to 100 gives us back bad results. 


But maybe if we change just one of the latent variables, then we'll see 
that there’s a meaningful shift. Let’s add a random value from —100 to 
100 to only the first latent value. The results are in Figure 25.27. 
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Figure 25.27: Adding a random value from -100 to 100 to just the first 
latent variable for each digit in Figure 25.23. 
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The 7 and o didn’t depend on the first latent value too much, so they 
came through the process pretty well. The other digits that did depend 
on that value got pretty corrupted. 


Let’s try using latent variables that are all completely random. In 
Figure 25.28 each latent value is a random number between 0 and 25. 
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Figure 25.28: Applying our decompression step to entirely random 
sequences of latent variables. 





These white globs are definitely not digits. So in a practical sense, for 
this autoencoder, random latent values seem to produce meaningless, 
random output images. 


25.5.2 The Parameter Space 


We've seen that just fiddling with the random variables doesn’t seem 
to give us any kind of predictable control over the output. Changing 
the latent values gave us digits that looked worse and worse. 


But maybe there’s some other structure in the latent variables that we 
can exploit. Let’s look for that structure. 


To look at the latent variables, we'll use the deep autoencoder of Figure 
25.21 but instead of making the last stage of the encoder a fully-con- 
nected layer of 20 neurons, we'll drop that to merely 2 neurons, so 


1409 


Chapter 25: Autoencoders 


we have just 2 latent variables. We saw from Figure 25.20 that with 3 
latent variables the matches were very fuzzy and not always close to 
the inputs, and it’s going to be even worse with only 2 latent variables. 
But we can plot 2 latent variables on the page, so let’s proceed any- 
way. We know that the encoder is going to be a pretty bad one, in the 
sense that the output images will look blurry, but maybe we can get 
some insight into the structure of the latent variables even with just 2 
of them. 


In Figure 25.29 we encoded 10,000 MNIST images and found each 
image’s 2 latent variables, and then plotted them all. Each dot is col- 
or-coded for the label assigned to its image. 
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Figure 25.29: After training a deep autoencoder with only 2 latent vari- 
ables, we show the latent variables assigned to each of 10,000 MNIST 
images. 
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There’s a lot of structure here! The latent variables aren’t getting essen- 
tially arbitrary numbers. Instead, similar images are getting assigned 
similar latent variables. 


In fact, most of the digits are nicely clustered. The 0’s, 1’s and 3’s, for 
instance, look like they have their own regions. The 7’s and 9’s have 
a lot of overlap (which they share with 4’s), telling us that this model 
can’t really tell these three types of images apart, and is assigning them 
similar regions of latent variables. 


Let’s look at the images that are generated if we step through the ranges 
of latent variables and hand them to the decoder. Figure 25.30 shows 
the result. 
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Figure 25.30: Images generated from latent variables in the range of 
Figure 25.29. 
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The other digits are also getting pretty well scrambled in the lower-left 
of the plot, getting assigned the same sorts of values. Figure 25.31 
shows a close-up view of that region. 
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Figure 25.31: A close-up of the lower-left corner of Figure 25.29. 


So it’s not a total jumble. The o’s have their own zone, while the 3’s 
and 5’s are pretty well mixed, as are the 6’s and 2’s. 


Let’s look at the images in this smaller range, shown in Figure 25.32. 
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Figure 25.32: Images from the close-up range of latent variables in Figure 
25.31. The digits are so well mixed in this zone that most of them come 
out as blended combinations . 


With more latent variables we’d expect things to become more sep- 
arated and distinct, though we wouldn’t be able to draw pictures of 
those high-dimensional spaces. 


Nevertheless, this shows us that even with an extreme compression 
down to just 2 latent variables, the system assigned those values in 
ways that grouped similar digits together. 


Let’s look more closely at this structure. Figure 25.33 shows the images 
encountered while moving along on four lines through the plot. 
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Figure 25.33: The four arrows in the plot correspond to the images gener- 
ated by points along those lines below. 


In row A we can see that the region for 1’s begins with vertical lines and 
ends with slanted lines. So it’s not just that all the 1’s are grouped, but 
even similar types of 1’s are grouped together. Row B shows a change 
from a recognizable 3 to digits that are looking more like 9’s. Row C 
goes through a short section in the bottom left, through the cloud of 
O’s, 2’s, and 6’s. Finally, row D starts near row C, but moves to the 
right into the zone of 3’s. 


We can infer from this that the encoder assigned similar latent vari- 
ables to similar images, and seemed to build clusters of different images, 
with each variation of the image in its own region. That’s a whole lot 
of structure. We can expect that as we go from a ridiculously small 2 


1414 


Chapter 25: Autoencoders 


latent variables to more useful larger numbers of them, the encoder 
will continue to produce clustered regions, but they will become more 
distinct and there will be less overlapping. 


25.5.3 Blending Latent Variables 


Now that we’ve seen the structure inherent in the latent variables, let’s 
use it. In particular, let’s blend some pairs of latent variables together, 
and see if we get an intermediate image. In other words, we'll do para- 
metric blending on the images, as we discussed earlier, where the 
latent variables are the parameters. 


We actually did this in Figure 25.31 as we blended the 2 latent variables 
from one end of an arrow to the other. But we were using an autoen- 
coder with just two latent variables, so it wasn’t able to represent the 
images very well. The results were mostly recognizable because most 
of our examples blended from one digit to another of the same type, 
though row C showed some pretty weird shapes. Let’s use some more 
latent variables so we can get a feeling for what this kind of blending, 
or interpolation, looks like in more complex models. 


We'll return to our deep autoencoder with 20 latent variables. We'll 
pick out pairs of images, find the latent variables for each one, and then 
simply average each pair of latent variables. Then we'll hand those to 
the decoder and see what it produces. 


In other words, we'll take the first latent variable for each image and 
find their average. Then we'll average the second latent variable for 
each image, and so on, until we have 20 averages. 


Figure 25.34 shows five pairs of images blended this way. 
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Figure 25.34: What happens when we blend latent variables in our deep auto- 
encoder. Top row: five images from the MNIST dataset. Middle row: Five more 
random images. Bottom row: The image resulting by averaging the latent vari- 
ables of the two images directly above, and then decompressing. 


As we expected, the system isn’t simply blending the images with con- 
tent blending (like we did for the cow and zebra in Figure 25.3). Instead, 
the autoencoder is trying to find intermediate images that have quali- 
ties of both inputs. 


These results aren’t absurd. Most of them look more or less like one of 
the endpoints, though in the second column the blend between a 2 and 
4 looks like a partial 8. That makes sense. Figure 25.30 shows us that 
the 2’s, 4’s and 8’s are close together in the diagram with only 2 latent 
variables, so it’s reasonable that they could still be near one another in 
a 20-dimensional diagram with 20 latent variables. 


Let’s look at this kind of blending of latent variables more closely. 
Figure 25.35 shows three new pairs of digits with 6 equally-spaced 
steps of interpolation. 
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Figure 25.35: Blending the latent variables for images by different 
amounts. The far left and right of each row are a pair of images from the 
MNIST data. We found the 20 latent variables for each endpoint, created 
six equally-spaced blends of those latent variables, and then ran that 
result through the decoder. 


The system is trying to move from one image to another, but it’s not 
producing very reasonable intermediate digits. Even when going from 
a5 to a5 in the middle row, the intermediate values almost break up 
into two separate pieces before re-joining. Some of the blends near 
the middle of the top and bottom rows don’t look like any digits at all. 
Although the ends are recognizable, the blends fall apart very quickly. 


Blending latent parameters in this autoencoder smoothly changes the 
image from one digit to another, but the in-betweens are just weird 
shapes, rather than some kind of blended digits. We’ve seen that some- 
times this is due to moving through dense regions where similar latent 
variables encode different digits. 


But in some of these cases we’re moving through regions of the latent 
variables where there’s no data. In other words, we’re asking the 
decoder to reconstruct an image from values of latent variables that 
were never given meaning by the encoder. So it’s producing something, 
and that output has some qualities of the nearby regions, but the latent 
variables don’t encode for any kind of image, so the results are hard to 
predict. 
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25.5.4 Predicting from Novel Input 


Just for fun, let’s try to using this deep autoencoder trained on MNIST 
data to compress and then decompress our tiger from Figure 25.1. 
We'll shrink the tiger to 28 by 28 pixels to match the network’s input 
size. 


The tiger is like nothing the network has ever seen before, so it’s com- 
pletely ill-equipped to deal with this data. It will try to “see” a digit in 
the image, and produce a corresponding output. Figure 25.36 shows 
the results. 





Figure 25.36: Compressing and then uncompressing a 28 by 28 version 
of our tiger of Figure 25.1 with our deep autoencoder of 20 latent vari- 
ables, trained on the MNIST handwritten digit dataset. This is a terribly 
unfair thing to ask of the network, and the results are as awful as we 
might expect. 


It looks like the algorithm has merged together an assortment of differ- 
ent pieces of different digits, trying to match the tiger. We can see the 
black border around most MNIST digits showing up, because there’s 
almost no data in that region for the program to use in order to fill in 
the edges. The splotch in the middle isn’t much of a match to the tiger, 
but remember that it shouldn’t be. 


Using information learned from digits to compress and decompress a 
tiger is like trying to build a guitar using parts taken from pencil sharp- 
eners. Even doing our best, the result isn’t likely to be a good guitar. 
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An autoencoder can only meaningfully encode and decode the type 
of data it’s been trained on, because it created meaning for the latent 
variables that allow it to represent that data. In other words, it had 
20 numbers to use as it pleased, and it used them to identify how to 
represent handwritten digits. When we surprise it with something 
completely different, it does its best, but it’s not going to be very good. 


25.6 Discussion 


We've seen something like an autoencoder before, when we talked 
about dimensionality reduction using Principal Component 
Analysis, or PCA, in Chapter 12. Recall that PCA found the dimen- 
sions along which the data was changing the most, and retained the 
data from those dimensions while discarding the data from the others. 


To connect autoencoders and PCA, we can write the mathematics that 
describe an autoencoder like that in Figure 25.21, and those of PCA, 
and compare them. After some work, we can show that these two dif- 
ferent approaches ultimately produce the same results [Virie14]. A 
more complex autoencoder could do even more than PCA. 


We now have an intuitive picture for our autoencoder: it’s trying to 
combine the values in the inputs in the most productive way to pack 
the most information into each of the intermediate values. 


There are several variations on the basic autoencoder concept. Since 
were working with images, and convolution is a natural approach for 
that kind of data, let’s build an autoencoder using convolution layers. 
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25.7 Convolutional Autoencoders 


We said earlier that our encoding and decoding stages could contain 
any kind of layers we wanted. Since our running example uses image 
data, let’s use convolutional layers. In other words, we'll build a con- 
volutional autoencoder. 


We'll build the encoder by scaling down the original 28 by 28 image 
to 7 by 7. All of our convolutions will use 3 by 3 filters. Our final con- 
volution layer will have 3 of these filters, so this model will have 
7 x 7 x 3= 147 latent variables. As shown in Figure 25.37, we'll start 
with convolution with 16 filters, and follow it by a maximum pooling 
layer with a 2 by 2 cell, giving us a tensor that is 14 by 14 by 16. Then 
we apply another convolution, this time with 8 filters, and follow that 
with pooling, producing a tensor that’s 7 by 7 by 7. We have a final 
encoder layer that uses three filters, so as we saw above, we have a ten- 
sor that’s 7 by 7 by 3 at the bottleneck. 


«p> O44 


16 x (8x3) 2x2 8x (8x3) a 3x (3x3) 2x2 16x (3x3) 2x2 ~~ 1x (8x3) 
ReLU ReLU ReLU ReLU sigmoid 








Figure 25.37: Architecture for our convolutional autoencoder. Left: In the 
encoding stage, we have three convolutional layers. The first two layers 
are each followed by a pooling layer, so by the end of the third convolu- 
tional layer we have an intermediate tensor of shape 7 by 7 by 3. Right: 
The decoder uses convolution and upsampling to grow the bottleneck 
tensor back into a 28 by 28 by 1 output. 


Our decoder runs the process in reverse. The first upsampling layer 
produces a tensor that’s 14 by 14 by 3. The following convolution and 
upsampling gives us a tensor that’s 28 by 28 by 16, and the final con- 
volution produces a tensor of shape 28 by 28 by 1. 
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Notice that the number of convolution filters starts out at 16 in the 
encoder, then drops to 8 and then 3. So we immediately grow our input 
from 1 channel to 16, then whittle that down to 3 channels, then back 
up to 16, then back down to 1 for output. 


Since we’ve got 147 latent variables, along with the power of the con- 
volutional layers, we should expect some very good results. We trained 
this convolutional autoencoder for 50 epochs, just as before. The 
model was still improving at that point, but we stopped at 50 epochs 
for the sake of comparison with the previous models. 


Figure 25.38 shows the first 5 examples from the test set and their 
decompressed versions after running through our convolutional 
autoencoder. 


Figure 25.38: Top row: The first 5 elements from the MNIST test set. 
Bottom row: The images produced by our convolutional autoencoder 
given each of the top row as input. 


These results are pretty great. The images aren’t identical, but they’re 
very close. 


Like before, let’s add a random value from to each element of our inter- 
mediate representation. We started as before by adding a value from 
—1 to 1. Figure 25.39 shows the results. 
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Figure 25.39: Top row: The first 5 elements from the MNIST test set. 
Bottom row: The images produced by our convolutional autoencoder 
after adding a random value from -1 to 1 to each of the latent variables. 


The results are hardly changed from before. 


There’s some visible change, but the images are quite similar. Let’s 
crank up the noise to the range —5 to 5 on each latent variable. The 


results are in Figure 25.40. 
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Figure 25.40: Top row: The first 5 elements from the MNIST test set. 
Bottom row: The images produced by our convolutional autoencoder 
after adding a random value from -5 to 5 to each of the latent variables. 


The results are a lot worse than before. 








These digits have deteriorated. Most are recognizable, but the 4 at the 
end seems to have pretty well dissolved. 
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Let’s crank up the noise and add a random value from the range —10 to 
10 to each latent variable. The results are shown in Figure 25.41. 


Figure 25.41: Top row: The first 5 elements from the MNIST test set. 
Bottom row: The images produced by our convolutional autoencoder 
after adding a random value from -10 to 10 to each of the latent variables. 
These results are unlike the inputs. 






These images look like random splotches of white paint dribbled on a 
black floor. We might be able to guess at the 0, but maybe not. 


For a final test, let’s try giving the decoder step nothing but noise, as 
we did before. Since our latent variables are a tensor of size 7 by 7 by 
3, our noise values will be a 3D volume of the same shape. So we'll just 
show the topmost 7 by 7 slice of the block that’s been filled up with 
random values. Figure 25.42 shows the results. 
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Figure 25.42: Each of these images was produced by handing an input 
tensor of random values to the decoder stage of our convolutional neural 
network. The upper images show the topmost 7 by 7 slice of the input 
tensor that holds the latent variables. They are blown up here to the 
same size as the outputs. The results not only don't look like digits; they 
don't look like anything. 














As with our previous autoencoder, random latent values in produced 
random splotchy images out. 


25.7.1 Blending Latent Variables 


Let’s blend the latent variables in our convolutional autoencoder and 
see how it goes. 


In Figure 25.43 we show our grid using the same images as in Figure 
25.34. We find the latent variables for each image in the top two rows, 
blend them equally, and then decode the interpolated variables to cre- 
ate the bottom row. 
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Figure 25.43: Blending latent variables in the convolutional autoencoder. 
Top two rows: Samples from the MNIST dataset. Bottom row: The result 
of an equal blend of the latent variables for each of the above images. 


The results are pretty gloppy, though some have a feeling of being a 
mix of the images above. 


Let’s look at multiple steps along the way in the same three blends that 
we used before in Figure 25.35. The results are shown in Figure 25.44. 
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Figure 25.44: Blending the latent variables of two MNIST test images and 
then decoding. The left and right ends of each row are decoded versions 
of two MNIST inputs. In between are the results of blending their latent 
variables and then decoding. 
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The important thing is to see how the system blends from one end- 
point to the other. The 4 in the top row isn’t really turning itself into an 
8, instead becoming a kind of hybrid of the 4 and 8 in the middle. The 
intermediate 5’s look pretty good, and since they’re all recognizable as 
5’s, those results are better than the version of all fully-connected lay- 
ers. The 1 to O transition looks more like a cross-dissolve than a single 
vertical stroke turning into a single circular stroke. 


25.7.2 Predicting from Novel Input 


Just for fun, let’s repeat our completely unfair test by giving the 
low-resolution tiger to our convolutional neural net. The results are 


shown in Figure 25.45. 





Figure 25.45: The low-resolution tiger we applied to our convolutional 
autoencoder, and the result. It’s not very tiger-like. 


This is surprisingly good. If we squint, it looks like the major dark 
regions around the eyes, the sides of the mouth, and the nose, have 
been preserved. Maybe. 


As with our earlier autoencoder built from fully-connected layers, our 
convolutional autoencoder is trying to find a tiger somewhere in the 
latent space of digits. It seems to be having more success. 
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26.8 Denoising 


A popular use of autoencoders is to remove noise from samples. 
A particularly interesting application is to remove the speckling 
that sometimes appears in computer-generated images [Bako17] 
[Chaitanya17]. These bright and dark points, which can look like static, 
or snow, can be produced when we generate an image quickly, without 
refining all the results. 


Let’s see how to use an autoencoder to remove bright and dark dots 
in an image. We'll use the MNIST again, but this time we'll add some 
random noise to our images. At every pixel, we'll pick a value from 
a Gaussian distribution and add it in, then clip the resulting values 
to the range 0 to 1. Figure 25.46 shows some MNIST training images 
with this random noise applied. 
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Figure 25.46: Top: MNIST training digits. Bottom: the same digits but 
with random noise. 










Our goal is to give our trained autoencoder the noisy versions of the 
digits, as in the bottom row of in Figure 25.44, and have it return 
cleaned-up versions, as in the top row of in Figure 25.44. So we'll teach 
it using noisy-clean pairs, with the noisy image as the input and the 
clean image as the output. We'll hope that the system will learn how to 
encode each noisy image in such a way that when the latent variables 
are decoded, we get back the clean image. 
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We'll use an autoencoder that alternates convolution layers with 
upsampling and downsampling layers [Cholleti7]. We begin with a 
convolution layer of 32 filters with size 3 by 3, then follow it with a 
downsampling layer with a 2 by 2 window. The output tensor thus has 
a shape of 14 by 14 by 32. We'll repeat that again, so the tensor is now 
a block that is 7 by 7 by 32. Now we'll apply convolution again, but now 
follow it with an upsampling layer that doubles the width and height. 
We'll do that again, so now our tensor is 28by 28 by 32. We'll pass that 
through a final convolution layer with just 1 filter. The architecture is 
shown in Figure 25.47. 


oO HoH 


32 x (3x3) 2x2 32x (3x3) 2x2 32x(8x3) 2x2 32x(3x3) 2x2 = 1 x (3x3) 
ReLU ReLU ReLU ReLU sigmoid 


Figure 25.47: A denoising autoencoder [after Chollet17]. 


To train our autoencoder, we'll give it the noisy images in Figure 25.45 
as inputs, and the clean, noise-free versions as the targets we want it 
to produce. We'll train with all 60,000 training images for 100 epochs. 


The tensor at the end of the decoding step in Figure 25.47 (that is, after 
the second downsampling step) has size 7 by 7 by 32, for a total of 1568 
numbers. So our “bottleneck” in this model is actually larger than the 
input. 


That would be bad if our goal was compression, but here we’re trying 
to remove noise. Minimizing the number of latent variables isn’t as 
much of a concern. 


How well does it perform? Figure 25.48 shows some of the noisy 
inputs, and the autoencoder’s outputs. Remarkably, it cleaned up the 
pixels that were too bright as well as those that we too dark, giving us a 
great-looking result. 
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Figure 25.48: Top row: Digits with noise added. Bottom row: The same 
digits de-noised by our model of Figure 25.45. 


We discussed in Chapter 21 that explicit upsampling and downsam- 
pling layers are falling out of favor, replaced by striding and transposed 
convolution. Let’s simplify our model of Figure 25.45 to make Figure 
25.49, which is now made up of nothing but as sequence of 5 convo- 
lution layers. The first 4 use striding and fractional striding to replace 
the downsampling and upsampling layers. 


32 x (3x3) 32 x (3x3) 32 x (3x3) 32 x (3x3) 1 x (3x3) 
ReLU ReLU ReLU ReLU sigmoid 
stride (2,2) stride (2,2) repeat(2,2) repeat (2,2) 





Figure 25.49: The autoencoder of Figure 25.47 but using downsampling 
and upsampling inside the convolution layers. The “repeat” value refers 
to using fractional striding to repeat each input element, and thus make 
the input tensor wider and higher. 


Figure 25.50 shows the results. 
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Figure 25.50: The results of our de-noising model of Figure 25.49. 


The results are quite close, though there are small differences (look at 
the bottom-left of the 0). The first model, Figure 25.45 with explicit 
layers for upsampling and downsampling, took roughly 300 seconds 
per iteration on a late 2014 iMac with no GPU support. The simpler 
model of Figure 25.47 took only about 200 seconds per iteration, so it 
shaved off about 1/3 of the training time. 


It would take a more careful problem statement, testing, and review 
of the results to decide if either of these models was producing better 
results than the other for a given task. 


25.9 Variational Autoencoders 


The autoencoders we've seen so far have tried to find the most efficient 
way to compress an input so that it later be recreated. They crunched 
the input down to a few latent variables, or a code, which were then 
fed to the decoder to recover the input. 


A variational autoencoder (or VAE) shares the same general archi- 
tecture as those networks (though with some important differences), 
but it gives us a different capability when we're done. The resulting 
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model can be used as a generator to create unlimited amounts of 
brand-new data that is like the input samples, but is not the same as 
any of them. 


If we could do this, then we could take the MNIST dataset, for example, 
and make ten thousand or ten million new images of digits, each one 
like the original MNIST images, but not a copy or just a tiny variation. 
Each one will be a brand-new image that is statistically like the MNIST 
images, but not actually any of them. 


As an analogy, suppose we were assigned to paint a picture of each 
tree in an orchard of cherry trees. After doing this long enough, we’d 
become very good at observing just what qualities make up a cherry 
tree, from its height, to the color and texture of its bark, to how it 
branches, and so on. We could then paint new trees out of our imagi- 
nation. They wouldn’t be simple variations on the trees we knew, and 
they wouldn't be collages of bits and pieces of those trees, stapled 
together like Frankenstein’s monster [Shelley18]. They'd be pictures 
of trees that look like the real trees, but are brand-new. We’d become a 
generator of pictures of cherry trees. 


We tried to turn our earlier autoencoders into generators by hand- 
ing random numbers to the decoder stages, but the results were just 
splotchy images, not pictures of digits. So to get better results, the VAE 
has to do something special, as we'll see. 


There’s an interesting quirk in a VAE that comes from how it’s built. 
Our previous autoencoders, one they were trained, were determin- 
istic. That is, the same input will always produce the same latent 
variables, and those latent variables will always then produce the same 
output. But the VAE uses probabilistic ideas (that is, random num- 
bers) in the encoding stage, so that if we run the same input through 
the system many times, we'll get a slightly different output each time. 


As we look at the VAE, we'll continue to phrase our discussion in 
terms of images (and pixels) for concreteness, but like all of our other 
machine learning algorithms, a VAE can be applied to any kind of data: 
sound, weather, movie preferences, etc. 
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25.9.1 Distribution of Latent Variables 


In our previous autoencoders we didn’t impose any conditions on the 
structure of the latent variables. We saw in Figure 25.29, repeated 
here in Figure 25.51, that for a 2D fully-connected system, the encoder 
seemed to naturally group the latent variables into blobs radiating to 
the right and upwards from the a common starting point at (0,0). That 
structure wasn’t a design goal. It just came out that way as a result of 
the nature of the network we built. 





Figure 25.51: Figure 25.29 repeated with samples drawn from both 
densely mixed and sparse regions. 


Here we have big white zones where no samples landed, and dense 
zones where the samples are jumbled together. Figure 25.49 also shows 
images from latent variables chosen from dense and sparse regions. 
These empty and densely-mixed zones are a problem for us when we 
want to generate new data using random latent variables. If we pick a 
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pair of latent variables far away from any saved data, the result can be 
distorted (like the 3 on the right side of the figure), or if we pick a point 
in the dense region we can get something confusing (like the smudgy 6 
at the bottom of the figure). 


We'd like to have all the digits clustered in their own regions without 
overlap, but also without any empty spaces. There’s no obvious way 
for us to modify the structure of the autoencoder to bring this about. 


There’s not much we can do about filling in empty zones, since those 
are places where we just don’t have input data. But we can try to break 
apart the mixed zones so that each digit occupies its own region of the 
plane. 


The variational autoencoder does a good job of meeting these goals. 


25.9.2 Variational Autoencoder Structure 


As so often happens with really good ideas, VAEs were invented simulta- 
neously but independently by at least two different groups [Kingma14] 

[Rezende14]. Understanding the technique in detail requires work- 
ing through some math that can be challenging, even when it’s been 

stripped down to its essence [Durr16]. 


Because we don’t want to get into the math here, we'll take an approx- 
imate and conceptual approach. This will necessarily mean that we’re 
going to leave out most details, and we'll approximate many others, 
but that’s okay because our intent is to capture the gist of the method 
rather than its precise mechanics. 


Recall that the big goal is to create a generator that will take in random 
latent variables, and produce outputs that are new and reasonably like 
the inputs. We'll get there by making the latent variables obey three 
guiding properties. First, they'll all be gathered into one region, so 
we know what the ranges should be for our random values. Second, 
latent variables produced by similar inputs (that is, images that show 
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the same digit) will be clumped together. Third, we’ll minimize gaps 
so that we always have some data to draw from when we build each 
output. 


We'll accomplish these goals in two ways. First, we'll design an 
architecture for a learning model that is potentially capable of accom- 
plishing them. Second, we'll craft an error term that punishes the 
system when it makes latent samples that don’t follow the rules. Since 
the system is designed to always minimize the error, it will necessarily 
start producing latent samples that are structured the way we want. 
The architecture and the error term are designed to work together. 


Let’s first tackle the idea of keeping all latent variables together in one 
place. We'll do that by imposing a rule, or constraint. As we mentioned, 
if the latent values that come out of the encoder don’t satisfy our con- 
straint, we'll punish the network. 


Our constraint will be that the values for each latent variable, when 
plotted, will come close to forming a unit Gaussian distribution. Recall 
from Chapter 2 that a Gaussian is the famous “bell curve,” illustrated 
in Figure 25.52. 


Figure 25.52: A Gaussian curve. 


When we place two Gaussians on the plane, one on the X axis and one 
on the Y, we get a 2D bump above the plane, as in Figure 25.53. 
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Figure 25.53: If we have two dimensions, we can place a Gaussian on 
each one. Together, they form a 3D bump over the plane. 


We can extend this to 3D by including another Gaussian on the Z axis. 
If we think of the size of the resulting bump as a density, then the 3D 
Gaussian is like a dandelion, which has a lot of stuff in the center but 
becomes sparser as we move away). 


By analogy, we can imagine a Gaussian of any number of dimensions, 
just by saying that each dimension’s values follow a Gaussian curve 
on their axis. And that’s what we'll do here. We'll tell the VAE to learn 
values for the latent variables so that, when we look at the latent vari- 
ables for lots of training samples, and we count up how many times 
each value occurs, every variable’s counts will form a distribution like 
a Gaussian that has its mean (or center) at 0, and a standard devia- 
tion (that is, its spread) of 1, as in Figure 25.54. Recall from Chapter 2 
that this means that about 68% of the values we produce for this latent 
variable will fall between —1 and 1. 


+ 0 1 


Figure 25.54: A Gaussian is described by its mean (the location of its center), 
and its standard deviation (the symmetrical distance that contains about 
68% of its area). Here we have a center of 0, and a standard deviation of 1. 
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When we're done training, we'll know that our latent variables will be 
distributed according to this pattern. So if we pick new values to feed 
to the decoder, and we select them from this distribution (that is, we’re 
more likely to pick each value near its center and within the bulk of its 
bump rather than off to the edges), we'll be likely to generate sets of 
latent values that will be near values we learned from the training set, 
and thus we can create an output that will also be like the training set. 


This naturally also keeps the samples together in the same area, since 
they’re all trying to match a Gaussian distribution with a center of o. 


Getting the latent variables to fall within unit Gaussians is an ideal 
we rarely achieve. There’s a tradeoff between how well the variables 
match Gaussians, and how accurate the system can be in recreating 
inputs [Fransi6]. The system automatically learns that tradeoff during 
the training session, balancing off the differences between the input 
images and the generated outputs, and the structure of the latent 
variables. 


Getting the latent values of all images with the same digits to clump 
together comes from a clever trick that uses some randomness. It’s a 
little bit subtle. 


Let’s start by assuming that we’ve already achieved this goal. We'll 
see what this implies from a particular point of view, and that will tell 
us how to actually bring it about. 


In other words, we’re assuming that every set of latent variables for 
an image of, say, the digit 2 will be near every other set of latent vari- 
ables for images of the digit 2. We'll do even better, though. Some 2’s 
have a loop in the lower-left corner. So in addition to having all the 2’s 
clumped together, we'll keep all the 2’s with loops together and all the 
2’s without loops together, and the region between those clumps will 
be filled with the latent variables of 2’s that sort-of have a loop, as in 
Figure 25.55. 
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Figure 25.55: A grid of 2’s organized so that neighbors are all like one 
another. 


Now let’s carry this idea to its limit. Whatever the shape and style and 
line thickness and tilt and so on of every image that’s labeled a 2, its 
latent variables will be near other those of other images labeled 2 that 
show about the same shape and style. So we'll gather together all the 
2’s with a loop and all those without, all those drawn with straight 
lines and all those drawn with curves, all those with a thick stroke and 
all those with a thin one, all the 2’s that are tall, and so on. That’s the 
major value of using lots of latent variables: they let us clump together 
all of the different combinations of these features, which wouldn’t be 
possible in just 2 dimensions. So in one place we'll have thin straight 
no-loop 2’s, another region will be thick curved no-loop 2’s, and so on, 
for every combination. 


If we had to identify all of these features ourselves this scheme wouldn’t 
be very practical. But a VAE will not only learn the different features, it 
will automatically create all the different groupings for us as it learns. 
As usual, we just feed in images and the system does all the rest of the 
work. 
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This “nearness” criterion is measured in the imaginary space that has 
one dimension for each latent variable. In 2 dimensions, each set of 
latent variables creates a point on the plane, and their distance (or 
“nearness”) is the length of the line between them. We can generalize 
this idea to any number of dimensions, so we can always find the dis- 
tance between two sets of latent variables, even if each one has 30 or 
even 300 values. 


Suppose that our system is given an input of an image of the digit 2, 
and the encoder finds the latent variables for it. Before we hand these 
to the decoder to produce an output image, let’s add a little random- 
ness to each of the latent variables, and pass those modified values to 
the decoder, as in Figure 25.56. 


random 
numbers 


reed 
A+a=2 


encoder —> []+m@=[77 —» decoder 


O+a= 


\7 


latent 
variables 


Figure 25.56: One way to add randomness to the output of the encoder 
is to add a random value to each latent variable before passing them to 
the decoder. 


Because we’re assuming that all of the examples of the same style are 
clumped together, the 2 we generate from the perturbed latent vari- 
ables will be similar to our input, and thus the output image will be like 
the input image, and the error that measures the difference between 
the images will be low. Then we can make lots of new 2’s that are like 
the input 2, just by adding different random numbers to the same set 
of latent values. 
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That’s how it works once the clumping has already been done. 


To bring this about, all we have to do is implement the process and 
then give the network a big error score when the output doesn’t come 
very close to matching the input. The system, trying to minimize that 
error, will change how it computes the latent variables in the encoder 
(and how it uses them in the decoder) so that they behave this way. 


But we took a shortcut just now that we can’t follow in practice. If 
we just added random numbers as in Figure 25.54, we wouldn’t be 
able to use the backpropagation algorithm we saw in Chapter 18 to 
train the model. The problem comes about because backpropagation 
needs to compute the gradients flowing through the network. But the 
mathematics of an operation like Figure 25.54 don’t let us calculate 
the gradients the way we need to. And without backpropagation, our 
whole learning process disappears in a puff of smoke. 


So VAEs use a clever idea to get around this problem, replacing the 
random-number adding with a similar idea that does about the same 
job, but which lets us compute the gradient. It’s a little bit of math- 
ematical substitution that lets backpropagation work again. This is 
called the reparameterization trick. As we've seen a few times, the 
word “trick” is sometimes used to refer to a particularly clever type of 
mathematical manipulation. 


It’s worth knowing about this trick because it often comes up when 
reading about VAEs (there are other mathematical tricks involved, but 
we won’t go into them). 


The trick is this: instead of just picking a random number from thin air 
for each latent variable and adding it in, we select a random variable 
from a probability distribution. That value now becomes our latent 
variable [Doersch16]. Recall from Chapter 2 that a probability distri- 
bution can power a routine that produces random numbers each time 
we ask it, but the numbers are drawn so that some are more likely than 
others. In this case we'll use a Gaussian again. This means when we 
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ask for a random value, we’re most likely to get a number near zero 
(where the bump is high), and less and less likely to get numbers that 
are farther away from oO. 


Since each Gaussian requires a center (the mean) and a spread (the 
standard deviation), the encoder will produce this pair of numbers 
for each latent variable. So if our system has 8 latent variables, then 
encoder will produce 8 pairs of numbers: the center and spread for a 
Gaussian distribution for each one. For each latent variable we pick a 
random number from its distribution, and that’s the value of the latent 
variable that we then give to the decoder. In other words, rather than 
add a random value to the existing latent variable, we pick a new value 
for the latent variable that’s pretty close to where it was, but has some 
randomness built in. These two approaches are similar, but only the 
latter one lets us apply backpropagation to the network. 


Figure 25.57 shows the idea. 
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Figure 25.57: In a VAE, each output of the encoder is then passed to two 
independent layers. One estimates the center of a Gaussian curve that we 
imagine that value came from, and the other estimates that curve’s stan- 
dard deviation, or spread. Then we choose a random value treating that 
Gaussian as a probability distribution, and that value is what we then call 
the latent value, presented at the output of the encoder. This is repeated 
for each latent value. 
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To put this into our autoencoder network, after the encoder has cre- 
ated a value for each latent variable, we split the network. We make 
one layer to compute the center of the Gaussian, and one to compute 
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the spread. Then we make another layer that combines the data from 
these two layers, using the center and spread for each random variable 
to pick a single random value, and that is the latent value that comes 
out of the encoder. Figure 25.58 shows the idea. 
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Figure 25.58: Picturing the split-and-combine sampling step of a VAE 
for 3 latent values. The encoder produces 3 values, and for each one we 
compute a center and spread. Those 3 different Gaussian bumps are then 
randomly sampled, and those selected values are the latent values are 
produced as the output of the encoder. 
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This operation is why we said earlier that each time we send a sample 

into a VAE, we get back a slightly different result. The encoder is deter- 
ministic up to and including the split. That is, we'll always get the same 

centers and spreads for each latent variable. But then that last stage 

picks a random value for each latent variable from those Gaussians, 
and those will be different each time. 


29.10 Exploring the VAE 


Figure 25.59 shows the architecture of the VAE we'll be using here. 
It’s just like our deep autoencoder built from fully-connected layers of 
Figure 25.21, but with two changes. 





Centers 












Pick from | 
Gaussians 











Vdd) 


256 512 784 
ReLU ReLU sigmoid 


512 256 


ReLU ReLU mpleas 





Figure 25.59: The architecture of our VAE for MNIST data. There are 20 
latent values. 


The first change is that we now we have the randomizing step at the 
end of the encoder. The second change is that we’re using a different 
loss, or error, function, though that’s not shown in the block diagram. 


In addition to the job of our previous loss functions, which measured 
the distance of each output from its input, our new loss function also 
measures how unlike the encoding and decoding stages are. After all, 
whatever the encoding stage is doing, we want the decoding stage to 
undo it. We measure this with the Kullback-Leibler (or KL) divergence 
that we saw in Chapter 6. Recall that this measures the error we get 
from compressing information using an encoding that is different from 
the optimal one. In this case, the optimal encoder is the opposite of the 
decoder, and vice-versa. So the big picture is that as the network tries 
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to decrease the error, it is therefore decreasing the difference between 
the encoder and decoder, bringing them closer to mirroring each other 
[Altosaar16]. 


Let’s see what comes out of this VAE for some of our MNIST samples. 
Figure 25.60 shows the result. 
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Figure 25.60: Predictions from our VAE of Figure 25.57. Top row: Input 
MNIST data. Bottom row: Output of the variational autoencoder. 


It’s no surprise that these are pretty good matches. Our network is 
using a lot of compute power to make these images! But we saw above 
that the VAE will produce different outputs each time it sees the same 
image. Let’s take the 2 from this test set and run it through the VAE 8 
times. The results are in Figure 25.61. 
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Figure 25.61: The VAE produces a different result each time it sees the 
same input. Top row: the input image. Middle row: The output from the 
VAE after handing it the input 8 times. Bottom row: the pixel by pixel 
differences between the input and each output. Increasingly bright red 
means larger positive differences, increasingly bright blue means larger 
negative differences. 


These 8 results from the VAE are similar to each other, but we can see 
obvious differences. Notice the changes in the lower right, near the 
end of the stroke. 


Let’s go back to our original set of 5 images, but add additional noise 
to the latent variables that come out of the encoder, as we did in ear- 
lier sections. This will give us a good test of how clumped-together 
the training images become in the space of the latent variables. Recall 
that adding noise to the latent variables in our previous autoencoders 
ruined the output pretty quickly. 


Let’s try adding a random value that’s up to 10% of each latent vari- 
able’s amount. Figure 25.62 shows the result of adding this moderate 
amount of noise to the latent variables. 
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Figure 25.62: Adding 10% noise to the latent variables coming out of 
the VAE encoder. Top row: Input images from MNIST. Bottom row: the 
decoder output after adding noise to the latent variables produced by 
the encoder. 


Adding noise doesn’t seem to change the images much at all. Let’s 
crank up the noise, adding in a random number as much as 30% of the 
latent variable’s value. Figure 25.63 shows the result. 
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Figure 25.63: Perturbing the latent variables by up to 30%. Top row: The 
MNIST input images. Bottom row: The results from the VAE decoder. 


Even with a lot of noise, the images still look like digits. Even when the 
7 changes significantly, it changes into a bent 7, not a random splotch. 
This is a far better result than we got from adding noise in our previ- 
ous autoencoders. 
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Let’s try blending the parameters for our digits and see how that looks. 


Figure 25.64 shows the equal blends for the 5 pairs of digits we’ve seen 
before. 
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Figure 25.64: Blending latent variables in the VAE. Top and middle row: 
MNIST input images. Bottom row: An equal blend of the latent variables 
for each image, decoded. 


The interesting thing here is that these are all looking more-or-less like 


digits (the leftmost image is the worst in terms of being a digit, but it’s 
still a coherent shape). 


Let’s look at some linear blends. Figure 25.65 shows the intermediate 
steps for the 3 pairs of digits we’ve seen before. 
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Figure 25.65: Linear interpolation of the latent variables in a VAE. The 
leftmost and rightmost image in each row the output of the VAE for 
a specific input. The images in between are decoded versions of the 
blended latent variables. 







The 5 is looking great, moving through a space of 5’s from one version 
to another. The top and bottom rows have plenty of images that aren’t 
digits. They’re like the shape blends that we saw in earlier autoencoders. 


Let’s run our tiger through the system just for fun. Remember, this is 
a completely unfair thing to do, and we shouldn’t anything meaningful 
to come out. Figure 25.66 shows the result. 





Figure 25.66: Running our low-resolution tiger through the VAE. 


The VAE seemed to think the tiger was some kind of loop. The inter- 
esting thing is that the result isn’t just random blobs like splattered 
paint, but something that has a coherent digit-like structure, even if it 
isn’t an actual digit. 
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All of these experiments suggest that our latent variables are clumping 
up the way we wanted: if we pick a point in the latent space, we get 
back something that’s digit-like, even if it’s not a digit. 


We re-trained our VAE with just 2 latent variables (rather than the 20 
we've been using), and plotted them in Figure 25.67 
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Figure 25.67: The placement of latent variables for 10,000 MNIST images 
from our VAE trained with 2 latent variables. 
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This is a great result. The Gaussian bump they’re trying to stay within 
is represented here by a black circle, and it seems pretty well populated. 
The various digits are generally well clumped. There’s some confusion 
in the middle, probably where the oddly-drawn digits fell. Remember 
that this image uses just 2 latent variables, so we would expect things 
to be better when there are more dimensions to work with. 


Let’s make a grid of images that correspond to our two latent variables. 
Still working with just two latent variables, we'll take the (x,y) values 
of each point and feed them to the decoder as though they were latent 
variables. Our range will be —3 to 3 on each axis, like the circle in Figure 
25.67. In other words, we'll imagine a grid in Figure 25.67 from —3 to 3 
in each axis, and for each point in that grid, we'll feed the correspond- 
ing latent variables into the decoder. The result is Figure 25.68. 
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Figure 25.68: The output of the VAE treating each (x,y) value as the two 
latent variables. Each axis runs from -3 to 3. 


We can see how nicely the digits have been clumped together. There 
are a few places the images get fuzzy, but even with just 2 latent vari- 
ables most of the images are digit-like. 


Remember that one of our goals was to be able to produce new data 
that was like the inputs, so let’s try that out. As we did for our previ- 
ous autoencoders, we'll feed our VAE purely random latent variables. 
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In other words, we'll remove the encoder from the network, pick ran- 
dom latent variables from Gaussian distributions, and decode them, 
as shown in Figure 25.69. 


256 512 784 
ReLU ReLU sigmoid 


Figure 25.69: To use our VAE as a generator, we just feed sets of 20 
random numbers into the decoder stage. 


We'll use the version of the VAE with just 2 latent variables, so we’re 


drawing from the same VAE that produced Figure 25.68. Figure 25.70 
shows some of the results. 
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Figure 25.70: Images produced by the VAE decoder when presented with 
random latent variables. These are not hand-picked. We just created 80 
random sets of 2 latent variables, decoded them, and saved the image. 


These are looking pretty great for the most part. There are some odd 
shapes and some fuzziness, but overall most of the images are recog- 
nizable digits. Many of the mushiest shapes seem to have come from 
the boundary between the 8’s and the 1’s, leading to narrow and thin 
8’s. 


Most of these digits are fuzzy, because we’re using only two latent 
variables. Let’s sharpen things up by training and then using a deeper 
VAE with more latent variables. Figure 25.71 shows our new VAE. This 
architecture is based on the MLP autoencoder that’s part of the Caffe 
machine-learning library [Jia16] [Donahue15] (recall that MLP stands 
for Multi-Level Perceptron, or a network built only out of fully-con- 
nected layers). 


1452 


Chapter 25: Autoencoders 









































Centers bee 
Pick from | | 
~RHEHX =e HRHEHEHE 


1000 500 250 50 250 500 1000 784 
ReLU ReLU ReLU ReLU ReLU ReLU ReLU sigmoid 








Spreads 





Figure 25.71: The architecture of a deeper VAE based on the autoencoder 
provided with Caffe. 


We trained this system with 50 latent variables for 25 epochs, and then 
generated another grid of random images. As before, we used just the 
decoder stages as in Figure 25.72. 


250 500 1000 784 
ReLU ReLU ReLU sigmoid 


Figure 25.72: We'll generate new output using just the decoder stage of 
our deeper VAE, feeding in 50 random numbers to produce images. 


The results are in Figure 25.73. 
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Figure 25.73: Images produced by our bigger VAE when provided with 
entirely random values. 


These images have significantly crisper edges than the images in Figure 
25.68. For the most part, we’ve generated entirely recognizable and 
plausible digits from purely random latent variables, though as usual 
there are some weird images that aren’t much like digits. These are 
probably coming from the zones between digits, so we’re getting odd- 
ball blends of the shapes. 


Once our VAE has been trained, we can drop the encoder and just save 
the decoder. This is now a generator that we can use to create as many 
new digit-like images as we like. 
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Reinforcement 
Learning 


Another way to teach a system how 

to perform as we like is to reward it when 

it does well. We'll see how to take this principle 
and turn it into a working algorithm for learning. 
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26.1 Why This Chapter Is Here 


There are lots of ways to train a machine learning system. When we 
have a set of labeled samples, we can use supervised learning to teach 
the computer to predict the right label for each sample. When we can’t 
offer any feedback, we can use unsupervised learning and let the com- 
puter do its best. 


But sometimes we’re somewhere in between these two extremes. 
Perhaps we know something about what we want the system to learn, 
but it’s not as clear-cut as having labels for samples. Perhaps all we 
know is how to tell a better solution from a worse one. 


For example, we might be trying to teach a new kind of humanoid 
robot how to walk on two legs. We don’t know exactly how it ought to 
balance, and how it should move, but we know we want it to be upright 
and not falling over. If the robot tries to slither on its belly, or hop on 
one leg, we can tell it that’s not the right way to proceed. If it puts both 
legs on the ground and then uses them to make some forward prog- 
ress, we can tell it that it’s on the right track and keep exploring those 
kinds of behaviors. 


This strategy of rewarding what we recognize as progress is called 
reinforcement learning [Sutton17]. 


The term describes a general approach to learning, rather than a spe- 
cific algorithm. 


The general idea is that an agent (or actor) takes actions in an envi- 
ronment. The environment then sends feedback back to the agent, 
describing how “good” we believe those actions to have been, using 
whatever criteria we like. We also give the agent the new state of the 
environment, which may have changed as a result of those actions. 


Through trial and error, the agent can discover which actions are bet- 
ter than others in a given situation, and can then choose one of the 
good actions that if the same situation comes up again. 
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26.2 Goals 


The reinforcement learning approach works particularly well for situa- 
tions where we don’t already know the best thing to do at all times. For 
example, consider the problem of scheduling the elevators in a tall and 
busy office building. 


Even just figuring out where elevator cars ought to go when they’re 

empty is hard. Should the cars always return to the ground floor? 

Should some wait at the top? Should they wait at floors evenly-distrib- 
uted between the top and bottom? Maybe these policies should change 

over time, so in early morning and just after lunch the cars should be 

on the ground floor, waiting for people arriving off the street, but in the 

late afternoon they should be higher up, ready to help people descend 

and head home. 


There’s no obvious answer to how we should schedule a particular 
building. It all depends on the average traffic pattern for that building 
(and that pattern itself might depend on the time, season, or weather). 


This is an ideal problem for reinforcement learning. The elevator’s con- 
trol system can try a policy for directing the empty cars, and then use 

feedback from the environment (such as the number of people waiting 
for elevators, their average waiting time, the density of the elevator 

cars, etc.) to help it adjust its policy to perform as well as it can on the 

metrics we’re measuring. 


Another nice example is how to best plant crops to help them grow. 
When should we plant the seeds? How deeply? How far apart? What’s 
the best watering schedule? Although it may take a lot of time to find 
the ideal strategy, the physical environment eventually gives us feed- 
back on our choices. We can use that feedback to repeatedly improve 
all of our decisions. 
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Reinforcement learning can help us with problems where we don’t 
know what the best result is. We may not have anything like “winning 
conditions” as in a game, but only better and worse outcomes. 


This is a key point: there may not be any objective, consistent “right” or 

“best” answer to be found. Instead, we’re trying to find the best answer 
we can with the information we have, according to whatever metrics 
were measuring by. We don’t seek the “right” answer, because there 
may not be one. We just seek the best one for now. 


In some situations, we may not even have any idea how well we’re 
doing along the way. For example, in a complex game we might not be 
able to tell if we’re ahead or behind, until the surprising moment when 
we win or lose. In those cases, we can only evaluate our actions in light 
of how things finally worked out when the task is done. 


Reinforcement learning offers a nice way to model uncertainty. In 
simple rule-based games, we can, in principle, evaluate any board and 
select the best move, assuming that the other player will always do the 
same. But in the real world, other players make moves that surprise 
us. And when we deal with the real world, where on some days more 
people need an elevator than on other days, or the rains don’t come for 
our crops, or they do come but they bring too much water, we need to 
have strategies that can continue to perform well in the face of such 
surprises. Reinforcement learning can be a good choice for these kinds 
of situations. 


Let’s look at reinforcement learning in more detail with a specific 
example. 


26.2.1 Learning A New Game 


Suppose we want to teach a friend to play tic-tac-toe (also called 
naughts and crosses, or Xs and Os). To play, the players alternate 
placing an X or O in the cells of a 3 by 3 grid, and the first to get three 
of their symbols in a row (in any direction) is the winner. We'll play O, 
and our friend will play X. 
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Figure 26.1 shows a series of boards in a typical game. 


x x x O}| |x 
OG O oo fe 
x X|O|X 


Figure 26.1: A game of tic-tac-toe, reading left to right. We start with an 
empty board. X goes first, then O replies, then X replies, and so on, until 
X wins with three in a row along the diagonal. 


We said that we want to teach our friend this game, but we didn’t say 
that she asked to be taught. The idea might have never even come up 
between us. So the first unusual thing in our strategy of teaching this 
game to our friend is that we wont’ tell her anything about what’s 
going on. 


Remember that in this metaphor, we won’t tell her that we’re playing 
a game. Perhaps she thinks that she’s helping us place items on store 
shelves, or design an attractive floor-tile pattern. Or perhaps she’s 
completely confused, but she’s willing to go along out of friendship 
and curiosity. 


We won't tell her how she might win or lose this game. And we cer- 
tainly won’t tell her how to play. Our friend is left almost completely in 
the dark about everything that’s going on. 


The key word here is “almost.” Since our friend is playing the role of 
the agent in this reinforcement learning analogy, we'll think of our- 
selves as the environment. As the environment, we'll give our friend 
four important pieces of information. 


First, we'll give her the current board, as shown in step 1 of Figure 26.2. 
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Figure 26.2: The basic information exchange loop between a player and 
the environment in a game of tic-tac-toe. In this example, the environ- 
ment and the player alternate turns, each waiting for the other. 1) The 
agent is shown the current board. 2) The agent is given a list of possible 
moves. 3) The agent selects a move. 4) The environment processes that 
move, and makes a countering move. 5) The environment sends a reward 


signal to the agent. 


Second, we'll provide her with a list of possible moves, as in step 2 
of Figure 26.2. In this case, we'll give her a list of the 7 empty cells in 


which she can place an X. 
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Alternatively, we could give her all 9 cells as possible moves, including 
cells that already have an X or O in them. Those are illegal moves, but 
she doesn’t know that. The strategy behind offering the agent illegal 
moves is that we’re hoping that the agent will eventually learn how 
to spot those illegal moves on her own. The thinking is that the more 
the agent knows about the game, including what not to do, the better 
she'll be able to play. In this example, the agent’s life is a little easier 
since she can’t make an illegal move. 


Once our friend has the board and the list of moves, she chooses one 
of those moves any way she likes. This is step 3 of Figure 26.2. Note 
that she doesn’t actually draw an X in the cell she’s picked. She just 
indicates which move she wishes us (the environment) to perform on 
her behalf. 


After she tells us which move she’s chosen, we execute her choice. If 
it’s an illegal move, we'll tell her so in a moment. Otherwise, we check 
for victory. If she’s just won the game, we'll tell her that. If neither of 
these things are true, we'll make our own move in response. This pro- 
cessing is shown in step 4 of Figure 26.2. 


Now we can give her the third piece of information: feedback in a 
reward signal. This is a single number, let’s say between 0 and 1. It 
tells our friend how “good” that action was, according to mysterious 
rules that she knows nothing about. This is shown in step 5 of Figure 
26.2. If she made an illegal move, the feedback would be o. If she loses 
the game entirely, the feedback would again be o. But if she wins the 
game, that winning move would get a reward of 1. Usually the feedback 
is a value between those extremes, expressing how “good” we think 
that move was. The better the move, the larger the feedback value. 


We'll make this reward signal something that our friend likes, so that 
she will want to get the biggest and most frequent rewards she can. 


We're showing this feedback as arriving after every move, but some- 
times we don’t know how well we’re doing until the end. In that case, 
the only feedback signal comes when the game is over. We'll see an 
example of that approach later in the chapter. 
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The fourth piece of information that we tell her is not shown in the 
figure, because it’s general guidance: she should strive for the biggest 
rewards possible. A reward of o is terrible, while a reward of 1 is the 
very best and means that she has successfully done whatever it is we’re 
hoping she'll do (in this case, get 3 X’s in a row). A reward of 1 is some- 
times called the ultimate reward or final reward. 


To recap, we present our friend with a board and some moves. She 
chooses a move, which we implement and then respond to. We give 
her back a reward signal to tell her how good her choice of move was. 
When it’s time for her to move again, we present her with the new 
board and a new list of moves, and the process repeats. 


Sometimes we imagine the loop slightly differently, and think of it as 
starting with step 2, followed by steps 3, 4, 5, and then 1. The flow 
of information is unchanged, but conceptually this lets us think of 
the new board as part of the feedback, so the agent can contemplate 
it along with the reward signal. This version of the loop is shown in 
Figure 26.3. 
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Figure 26.3: Another way to think about the flow of information in one 
loop of information exchange. We start with step 2 from Figure 26.2 and 
proceed with steps 3, 4, 5, and then 1. This loop picks up assuming the 
same starting board as in step 1 in Figure 26.2, where the top-right and 
center-left cells are occupied. 


There’s no practical difference in these two ways of thinking of the loop, 
because the flow of information back and forth is the same. It’s just a 
conceptual choice based on whether we prefer to think of the board as 
related to the list of actions as in Figure 26.2, or related to the reward 


signal, as in Figure 26.3. 
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Our friend, playing as the agent, has complete freedom in how she 
chooses to interpret the reward signal and how she chooses an action. 
She’s also free to maintain any private information she likes. We'll 
assume that she has accepted our suggestion that she strive for the 
best rewards. 


How should she proceed? 


Below we'll see a possible strategy that almost works well, but falls 
short. Then we'll tune it up a little to make it work much better. 


Before we dig in, recall our discussion of operant conditioning in 
Chapter 11. This training methodology, based on behaviorism, uses 
rewareds and punishments to encourage or discourage more of a par- 
ticular behavior. 


Reinforcement learning fits naturally into that framework. 


26.3 The Structure of RL 


Let’s generalize our example above into a more abstract description. 
This will let us embrace situations that go beyond turn-taking games, 
such as our previous examples of elevator control and growing crops. 


Figure 26.4 shows this more generalized view of reinforcement learn- 
ing, organized into three steps. We will look at each of these steps in 
detail below. 
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Figure 26.4: An overview of reinforcement learning, organized into 
three steps. Step 1: The agent selects an action. Step 2: The environment 
responds. Step 3: The agent updates itself. 


The goal is for the agent to learn how to make better and better choices 


from the list of actions. That is, we want it to learn from experience 
how to perform increasingly well. 
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The whole process begins when we put the environment into an initial 
state. In a board game, this is the setup for the start of a new game. 
A full training cycle (such as a game from start to finish) is called an 
episode. We generally expect to teach the agent over a great many 
episodes. 


Let’s look at each of these three steps in order. 


26.3.1 Step 1: The Agent Selects an Action 


We begin with Step 1 in Figure 26.4, repeated here as Figure 26.5. 


available actions 






new state 





action 
private 
information 


Figure 26.5: Step 1 of our reinforcement learning diagram in Figure 26.4. 
The environment provides the agent with the current world state and a 
choice of actions. The agent chooses an action, which is implemented by 
the environment. 






The environment is the world in which all of our agent’s actions take 
place. The environment is completely described by a set of numbers 
that are collectively called the environmental state, the state vari- 
ables, or simply the state. This might be a short list, or a very long 
one, depending on the complexity of the environment. In the case of 
a board game, the environment would usually be the position of all 
markers on the board, plus any game assets (such as game money, 
power-ups, hidden cards, etc.) held by each player. 
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We also have an agent, who is able to look at the environment and 
take an action that can affect it. We often anthropomorphize the agent 
and talk about how the agent “wants” to achieve some result. In basic 
reinforcement learning, the agent is idle until the environment gives 
the agent feedback and tells the agent that it’s time to take an action. 


The agent chooses an action from the list using an algorithm called 
its policy, along with whatever private information the agent may 
wish to maintain. 


We usually think of the agent’s private information as a database. It 
might contain descriptions of possible strategies, or some kind of his- 
tory of the actions that were taken in particular states, and the rewards 
that were returned. 


The policy, by contrast, is an algorithm that is usually controlled by 
a set of parameters. The parameters will usually change over time as 
the agent plays and searches for improved action-choosing policies. 
The algorithm of the policy might also change over time, but that’s less 
common. 


We could think of the policy’s parameters as part of the private infor- 
mation if we like. But it’s usually helpful to conceptually distinguish 
the information held by the agent into the database of retained infor- 
mation (the private information), and the algorithm and parameters 
that guide its choice of action (lumped together as the policy). The 
overall goal is for the agent to use the private information to improve 
the parameters of the policy over time, learning from successes and 
failures. 


We usually don’t think of the agent implementing its action. Instead, 
the chosen action is reported to the environment, and the environment 
takes care of performing the action. This is because the environment 
could change as a result of the action even as it’s being executed. 


When the environment implements the agent’s chosen action, the envi- 
ronment itself usually changes as a result. This can produce a cascade 
of further activity as the environment responds to its own changes, and 
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then changes in response to those changes, and so on. Returning to our 
bipedal robot, our controller (the agent) might direct the robot to put 
its right foot forward. As the robot (the environment) is doing this, it 
might find that it needs to duck its head forward so its center of gravity 
stays over its feet, but that could cause the head to pivot upwards so 
it can see where it’s going, which might cause it to notice an obstacle, 
and therefore put out a hand, causing more motion of its other body 
parts so it doesn’t fall over, and so on. The agent might wait until the 
robot has become stable, and then evaluate the situation and direct 
the robot to take another action. 


In our tic-tac-toe game, the environment’s changes would involve 
placing an X in the agent’s chosen cell, and then making a new move 
by placing an O in an empty cell. 


When the environment is done changing, its updated configuration is 
saved in the state variables. 


26.3.2 Step 2: The Environment Responds 


Let’s continue to step 2 of Figure 26.4, repeated here as Figure 26.6. 
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Figure 26.6: Step 2 of our reinforcement learning diagram in Figure 26.4. 
In this step, the environment saves its new state, prepares actions for the 
agent, sends the new state to the agent, and gives the agent feedback. 
The new state saves its variables in the state variables at the far left. 


In this step, the environment updates itself, and prepares information 
that is communicated to the agent to provide feedback on its action. 


The environment saves its new state in the state variables, so that they 
will reflect the new environment when the agent next gets to choose an 
action. The new state variables replace the old. 


The environment also uses its new state to determine what actions the 


agent might choose on its next move. This new list of actions replaces 
the old list. 


The environment also provides a reward signal that tells the agent 
how “good” its last chosen action was. The meaning of “good” is com- 
pletely dependent on what this whole system is doing. In a game, good 
actions are moves that lead to stronger positions or even victory. In 
an elevator scheduling system, good actions might be those that min- 
imize wait times. For a robot, good actions might keep it upright and 
moving forward. 


1474 


Chapter 26: Reinforcement Learning 


Let’s look inside the agent for a moment. There are two update areas, 
one each for the private information and the policy. Each update area 
is responsible for using the environmental feedback, along with the 
current configuration of the private information or policy, to update 
each of those mechanisms. 


In this way of structuring the process, the update areas gather up the 
information they need during this step, and then write updated ver- 
sions in the next. 


26.3.3 Step 3: The Agent Updates Itself 


Step 3 of Figure 26.4 is repeated here as Figure 26.7. 


private 
information 





Figure 26.7: Step 3 of our reinforcement learning diagram in Figure 26.4. 
Here the new, updated versions of the private information and the policy 
replace the old versions. 


In this step, the agent saves its updated private information and policy 
parameters. 


After Step 3, the agent might wait quietly until the environment tells it 
that it’s time to take action again. Alternatively, it could actively plan 
for its next move, contingent on what it learns from the environment 
when its last action has been completely processed. This is particu- 
larly useful for real-time systems which need to react quickly when the 
environment asks for an action. 
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When we're playing a game, at some point the game usually ends. This 
is usually signaled to the agent with a final reward (or ultimate 
reward). If the agent is trying to win a game or achieve some other 
bounded task, this signal often also tells the agent that the game is 
over. This marks the end of that episode of training. 


The goal of reinforcement learning is to discover ways to help the agent 
in this scenario learn from the feedback to choose actions that bring it 
the best possible rewards. Whether it’s winning a game, growing crops, 
or moving a robot, we want to create an agent that can learn from 
experience to become as good as possible at manipulating its envi- 
ronment to bring about positive rewards. 


26.3.4 Variations on The Simple Version 


The description above represented a polite form of turn-taking between 
the agent and the environment. But things can be more interesting. 


We will distinguish two types of behavior. The first, free-running, 
refers to an object (either the agent or environment) that is contin- 
uously acting and changing, regardless of outside intervention. The 
second, triggered, refers to an object that sits idle and unchanging 
until an external force stirs it into action. 


Both of these behavior types can be applied to both the agent and the 
environment, giving us the four combinations in Figure 26.8. 
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Figure 26.8: We can think of both the agent and the environment as 
having one of two conditions. A free-running object evolves and changes 
on its own, while a triggered object only acts when requested. The four 
examples are discussed in the text. 


In the upper-left, we have crop farming. The agent is the farmer, and 
the environment is the farm and weather. Both the environment and 
the agent are free-running. The environment is free-running because 
the crops, the fields, the weather, the insects, and more are all doing 
their own thing and changing even if the farmer does nothing. The 
farmer is free-running as well, since he can step in and work the fields 
any time he wants. 


In the upper-right, we’re painting a house. The agent is the house 
painter, and the environment is the outside of the house. The agent is 
free-running, since she can decide to paint anywhere and anytime she 
likes. The environment is triggered, because it only changes when the 
painter applies paint. 


In the lower-left, the behaviors are reversed. The agent is a chef pre- 
paring a meal, and the environment is an oven timer. The timer is 
free-running, because (once it’s started) it counts down whether the 
chef is doing anything or not. The chef is triggered, because she can’t 
take the meal out of the oven until the timer runs down. 
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In the lower-right we have a game of backgammon. The agent is one 
player, and that agent’s environment is the other player, the board, 
and the dice. The agent is triggered, because he can only roll the dice 
and make a move when it’s his turn. The environment is also triggered, 
because when it’s the agent’s turn, the environment waits and doesn’t 
change until the agent is done. 


This last variation is the easiest to model in the computer, because the 
flow of data is easy to predict and control. But the other variations are 
also useful. 


In this chapter, we’ll continue to use simple board games for our exam- 
ples, but reinforcement learning is perfectly amenable to these and 
other situations. 


For much of the rest of the chapter, when the context is clear, we'll 
imagine that we are the agent. So we'll often speak of things like “our 
policy” and “our experience,” meaning the agent’s policy or experience. 


26.3.5 Back to the Big Picture 


There are many ways to formalize the structure of reinforcement learn- 
ing. We can generalize the steps of Figure 26.4 to include triggered and 
non-triggered scenarios for both the agent and the environment. 


In general, we can imagine that the agent is an abstract entity using 
actions to manipulate the environment from one state to another. We 
call such actions transitions, since they cause the environment to 
change, or undergo a transition, from one state (that is, one set of state 
variables) to another. In this view, our learning problem basically boils 
down to states, transitions, and rewards. This way of thinking of 
things, along with a bit of additional information, is called a Markov 
decision process, or MDP. A Markov process models a system’s 
behavior over time using a collection of states, transitions from one 
state to the next, and a distribution that gives us the probabilities for 
each transition to occur [Shalizio7]. From a starting state, we apply 
transitions that are randomly chosen according to the distribution, 
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perhaps until we reach an ending state. Framing our goal of teaching 
the agent as a Markov process offers a rich theoretical framework, but 
we won't be needing it here. 


When the agent updates its policy, it might have access to all the 
parameters of the state, or only some of them. If an agent gets to see 
the entire state, we say it has full observability, otherwise it has only 
limited observability (or partial observability). One reason we 
might give an agent only limited observability is that some parameters 
might be very expensive to compute, but we’re not sure if they’re rele- 
vant or not. So we block the agent’s access to those parameters to see if 
it hurts the agent’s performance. If leaving them out does no harm, we 
can leave them off entirely from then on and save effort. 


As soon as we start thinking about using feedback to train agents in 
the way we’ve been discussing, we find ourselves facing two interest- 
ing problems. 


First, the credit assignment problem challenges us to find a way 
to use the ultimate reward (such as winning or losing a game) to mod- 
ify how we think about every move we made along the way. 


For instance, suppose we're playing a game and make a winning move. 
That move will get great feedback, which we can remember and asso- 
ciate with that move, but we’d like to somehow share that credit by 
assigning some of it to all the moves along the way that got us to that 
victory. That way if we see those intermediate boards and moves again, 
we'll be more likely to select the move that led to winning. 


By the same token, if we lose, we’d want to let the moves that led us 
there take some of the blame, so we'll be less likely to select them again. 


Second, the explore or exploit dilemma asks if we want to play 
it safe and choose moves that we know are likely to turn out well, or 
take risks and try new moves that might turn out to be better, but also 
might be terrible. 
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For instance, when playing a game, we might feel sure that taking 
a certain move will put us in a good position. But maybe a different 
move that we haven’t tried yet would put us in an even stronger posi- 
tion. The only way to find out if the new move would be better is to try 
it and find out. 


On the one hand, we want our agent to explore every possibility it can, 
so that it can discover the best choices in every situation. On the other 
hand, training time is almost always too short to try out every choice. 
It might be better to exploit the moves we already know are good, and 
build on that knowledge to find paths that can lead to victory, even if 
they’re not the fastest or even the most reliable paths. 


We need to somehow decide, when picking every move, whether we 
want to play it safe with a move we've tried before, or take a risk with a 
new move. 


26.3.6 Saving Experience 


Let’s suppose that we’re the agent playing a game. We'll get a reward 
signal at each step, up to and including the final move. We can remem- 
ber those rewards in our private information. 


Let’s adopt that as a general plan: we'll save every reward for every 
move in every game in our local storage. We'll keep that local storage 
around from one game, or episode, to the next. So our local storage will 
contain the history of every game we've ever played, and the reward 
we received for every move in every game. 


Along with each move’s reward we'll save a score, which is our overall 
estimate for how attractive that move is. In other words, if we see that 
board again, we'll know that the move with the highest score is the 
one that has turned out the best out of all the moves we've tried in that 
situation. 
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As we play new games and have to choose moves, we can always con- 
sult our history to see what we’ve learned from previous times we’ve 
seen this board. Moves that we haven’t tried will be marked as such. 
Moves that we have tried will have a score associated with them that 
tells us our combined experience with that move. A high score means 
that move led to better rewards than a move with a lower score. 


When it’s time to pick a move, we can either play it safe and select the 
move with the highest score, or take a risk and try a new move. 


26.3.7 Rewards 


In reinforcement learning, we’re trying to find a policy that leads the 
agent to pick the actions that deliver the highest rewards. So under- 
standing the nature of rewards, and how to use them wisely, is time 
well spent. Let’s dig in. 


We can distinguish two categories of rewards: immediate and long- 
term. Immediate rewards are the ones that are delivered by the 
environment back to the agent right after executing an action. Long- 
term rewards refer to our overall objective, like winning a game. 


We’d like to understand each immediate reward in the context of all the 
other rewards we get during a given game, or episode. There are lots 
of ways to interpret rewards and what they should mean to us. We'll 
look at one popular approach called the discounted future reward, 
or DFR. This is a way to represent the score we discussed at the end of 
the previous section. 


To see how the DFR works, we need to unwind the reward process a 
little bit. Let’s imagine that we’re an agent playing a game. 


When the game is done, we can line up the rewards we've collected for 
that game in a list, one after the other in the order they were received, 
along with the moves that earned those rewards. Adding up all the 
rewards gets us the total reward for that game, as in Figure 26.9. 
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total reward 


Figure 26.9: The total reward associated with any episode is the sum of 
all the rewards that arrive from the first to the last move of the episode. 


We can add up any piece of this list, say just the first 5 entries, or the 
last 8. In particular, suppose that we pick a specific step in the middle 
of the game, and add up all the rewards from there up to the game’s 
end, as in Figure 26.10. 


eeemove3 move4 |move5| move6é move7 move8 moveY 





total future 
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Figure 26.10: The total future reward for any move is the sum of the 
reward for that move, and all other moves to the end of the episode. For 
clarity, here we're just showing the moves from move 3 onwards. We also 
show the rewards from move 5 to the end of the game. 


1482 


Chapter 26: Reinforcement Learning 


Figure 26.10 shows us the total future reward, or TFR, associated 
with the 5th move of the game. It’s that part of the total reward that 
comes from the 5th move and all the moves that followed it. In other 
words, it’s the sum of the rewards from that move’s future, rather than 
its past. 


The very first move of a game is special, because its total future reward 
is the same as the game’s total reward. Since our rewards are always 
zero or positive, each subsequent move’s TFR will be equal to or less 
than the TFR of the move before it. 


The total future reward is a good description of what happened in the 
game we just finished, but it’s not as good at describing what might 
happen in future games, even if they start with exact same sequence of 
moves. 


This is because real environments are unpredictable. 


If we’re playing a multi-player game, we can’t be sure that the other 
player (or players) will act the same way in the next game as they did 
in the previous game. If they make a different move, then that can 
change the trajectory of the game, and thus also change the rewards 
we'd earn. It could even change whether we win or lose. Even if we’re 
playing solitaire, we might be playing with a shuffled deck of cards, or 
a computer game with “random” numbers, so we can’t be sure what’s 
going to come our way in the future, even if we play the exact same 
way we did in the past. 


Immediate rewards are more reliable. We can think of two types of 
immediate rewards. The first tell us the quality of the move we just 
made, before the environment responds. This reward will be com- 
pletely predictable. If we face the identical environment again later, 
and make the same move, we will get the same reward. 


The second kind of reward tells us the quality of the move we just made, 
after the environment responds, so the reward can be influenced by 
the environment’s response. This type of reward isn’t as predictable 
as the kind given before the response, because the environment might 
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respond in a different way each time we make the move. For instance, 
consider using a remote control to turn on a device. We might pick up 
the remote control, press the power button, turn on the device, and 
put the remote back down, in the very same way 100 times in a row. 
But unknown to us, the battery is draining, so the 101st time we try, 
the device won’t turn on. 


If the reward was given for pressing the button, that is, before the envi- 
ronment responds, we'd get full marks. If the reward was for turning 
on the device, that is, after the environment responds, that 101st time 
we'd get a low score, perhaps even O. 


Although the environment can be unpredictable, it’s usually important 
to take it into account as much as we can. Generally speaking, we take 
an action not simply to perform that action, but to bring about a result. 
So waiting to see that result, even if we can’t be certain about what will 
happen, is a big part of understanding if our action represents a good 
choice. 


We say that real environments, with their unpredictable elements, are 
stochastic. By contrast, a perfectly predictable environment (such as 
a game based purely on logic) is deterministic. 


The amount of unpredictability (or stochasticity) can vary in amount. 
If the unpredictability is low (that is, the environment is largely deter- 
ministic), then we might feel pretty confident about saying that the 
rewards we just received are likely to be repeated, or very nearly so, 
in future games. With very high unpredictability (that is, in a largely 
stochastic environment), then we have to assume that if we repeat the 
same actions, any predictions we make about future rewards are likely 
to be wrong. 


We need to somehow accommodate for the unexpected in a principled 
way. 
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One way to do that is to modify our value of the total future reward 
for any move with how confident we are that the game will proceed in 
just the same way again. The less certain we are, the lower this modi- 
fied TFR will go. That way we'll generally have high scores attached to 
moves we feel confident about, and low scores for the others. 


We quantify our estimate of the stochasticity, or uncertainty, of the 
environment with a discount factor. This is a number between 
oO and 1, usually written with the lower-case Greek letter y (gamma). 
The value of gamma that we select represents our confidence in the 
repeatability of the environment. If we think the environment is close 
to being deterministic, and we'll get about the same reward for a given 
action every time, we'll set gamma to a value near 1. If we think the 
environment is chaotic and unpredictable, we’d set gamma to a value 
near O. 


We can use the discount factor to create a version of the total future 
reward called the discounted future reward, or DFR. Rather than 
adding up all the rewards that come after an action, as we do for the 
TFR, we start with the immediate reward, and then we reduce the val- 
ues of the subsequent rewards by multiplying them by gamma one 
time for each step they are in our future. So the reward one step in 
the future is multiplied by gamma once, the reward after that is multi- 
plied by gamma 2 times, and so on. This accounts for the fact that we 
consider them increasingly less reliable. The technique is illustrated 
graphically in Figure 26.11. 
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Figure 26.11: The discounted future reward or DFR is found by adding 
together the immediate reward, the next reward after multiplying it by 
gamma, the reward after that multiplied by gamma twice, and so on. 


Notice that in Figure 26.11 each successive value gets multiplied by 
gamma one more time. These increased multiplications can have a 
significant effect. 


Let’s see this in action. We'll consider the reward and the discounted 
future reward we'd get from our opening move in a game, using sev- 
eral values for gamma. Figure 26.12 shows a set of immediate rewards 
for an imaginary game with 10 moves. 
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Immediate Rewards 


— rewards 


Figure 26.12: Immediate rewards for a game with 10 moves. The game 
ended without a clear winner. 


Applying different future discounts to these rewards following Figure 
26.11 gives us the curves of Figure 26.13. Notice how quickly the 
rewards drop to O as the discount factor decreases, meaning that we’re 
less sure of our predictions of the future. 
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Discounted Future Reward Curves by gamma 
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Figure 26.13: The rewards of Figure 26.12 as discounted by different values 
of gamma. When gamma is 1, we are saying that the game will always play 
out exactly the same way every time. That is, the system is deterministic. 
As gamma decreases, we have less faith in future values, so their contri- 
bution drops quickly. When gamma reaches 0, we believe the system is 
completely stochastic, and we have no confidence in any reward beyond 
the immediate one. 


Adding up the values of each curve in Figure 26.13 gives us the dis- 
counted future reward for the first move for different values of gamma. 
These DFRs are shown in Figure 26.14. Notice that as we think of 
the future as being increasingly unpredictable (that is, gamma gets 
smaller), the DFR also becomes smaller, because we're less confident 
of getting those future rewards. 
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Discounted Future Rewards by gamma 
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Figure 26.14: The discounted future reward (DFR) from Figure 26.13 for 
different values of gamma. Notice that gamma runs from 1 at the left to 0 
at the right. As gamma decreases and we have less belief that the system 
is deterministic and thus repeatable, we discount the rewards from the 
future by an increasing amount. Finally, when gamma is 0, we ignore all 
future rewards and the DFR is just the value of the immediate reward. 


When gamma has a value near 1, the future rewards aren’t diminished 
much, so the DFR is close to the TFR. In other words, we're saying that 
the total rewards we got for making this move are likely to be similar 
to the total rewards we'll get if we play this move again. 


But when gamma has a value near Oo, then the future rewards are scaled 
way down to the point where they practically don’t matter, and we’re 
left with just the immediate reward. In other words, we’re saying that 
we have little confidence that the game will continue again as it did this 
time, so the only reward we can be sure of is the immediate reward. 


In many reinforcement learning scenarios, we often pick a value of 
gamma around 0.8 or 0.9 to get started, and then adjust the value as 
we discover more about how stochastic our system is, and how well 
our agent is learning. 
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26.4 Flippers 


In the following sections we’re going to look at actual algorithms for 
learning a game. To keep our focus on the algorithms and not the game, 
we'll pare down tic-tac-toe into a new solitaire game which we'll call 
Flippers. 


We play Flippers on a square grid (we'll start with a 3 by 3 grid). Each 
cell holds a little tile that pivots around a bar, as in Figure 26.15. 





Figure 26.15: The board for the game of Flippers. Each tile is blank on 
one side and has a dot on the other. A move in the game consists of flip- 
ping (or rotating) one tile. 


One side of each tile is blank, while the other side holds a dot. On each 
move, the player pushes one tile to flip it over. So if it was showing a 
dot, the dot then disappears, and vice-versa. 


The game begins with the tiles in a random state. Victory comes from 
having exactly three blue dots showing, arranged in either a vertical 
column or horizontal row. All the other tiles are showing blanks. 


This may not be the most intellectually demanding game ever invented. 
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Starting from our random board, we want to get to any of these goal 
boards in the smallest number of flips. Note that there are six differ- 
ent boards that can satisfy the goal (three horizontal rows and three 
vertical columns). Getting the board into any of these states counts as 
victory (diagonals don’t count as a win). 


Figure 26.16 shows an example game, along with our notation for 
showing the moves. We read the game left to right. Each board but the 
last will show the starting configuration for that move, with one cell 
highlighted in red. That’s the cell that is going to be flipped over. 


Start final 


(a) (b) (C) (d) 


Figure 26.16: Playing a game of Flippers. (a) The initial board, showing 
three dots. The red square shows the tile we intend to flip for this move. 
(b) The resulting board is like part (a), but the tile in the upper-right has 
gone from blank to dot. Our move for this board is to flip the center tile. 
(c) through (e) show subsequent steps in game play. Board (e) is a winning 
board. 








(e) 


Now that we have a game to play, we can look at how to use reinforce- 
ment learning to win at it. 


1491 


Chapter 26: Reinforcement Learning 


26.5 L-learning 


Let’s use what we’ve seen so far to build a complete system for learn- 
ing how to play Flippers. Although we'll make this algorithm much 
better in the next section, this starting version is going to perform so 
badly we'll call it L-learning, where L stands for “lousy.” Note that 
L-learning is a stepping-stone that we invented to help us get to some- 
thing better, and not a practical algorithm that appears in the literature. 
It is, after all, lousy. 


To make things easy, we’re going to use a very simple reward system. 
Every move we make in Flippers gets an immediate reward of 0, except 
for the final move that wins the game. We know that every game can 
be won because Flippers is a solitaire game, so there are no surprises. 
To prove that every game can be won, we can take any starting board 
and flip over all the tiles that are showing a dot, so that there are no 
dots showing. Then we can flip over three tiles in any row or column, 
and we’ve won. 


Our challenge is not just to win, but to win in the smallest number of 
moves. 


The final, winning move gets a reward that depends on the length of the 
game. If it took 1 move to win the game, the reward is 1. If it took more 
moves, this final reward drops off quickly with the number of moves 
that were required. The specific formula for this curve is less import- 
ant than the fact that it drops off fast and is always getting smaller. A 
graph of our final reward versus game length curve is shown in Figure 
26.17. 
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Figure 26.17: The reward for victory in Flippers starts at 1 for an imme- 
diate win, but drops quickly with the number of moves required to win 
the game. 


At the heart of our system will be a grid of numbers that we’ll call the 
L-table (again, L is for “lousy”). 


Each row of the L-table represents one state of the board. That is, every 
entry in that row has the same arrangement of tiles showing blanks or 
dots. Each column will represent one of the nine actions we can make 
in response to that board. So each column represents which tile we 
choose to flip over on that move. 


The contents of each cell in the table will be a single number, which 
we'll call an L-value. Figure 26.18 shows this schematically. 


1493 


Chapter 26: Reinforcement Learning 


Actions 


uefjsza//ea/isee/ase|aes) -ealir=aier= 


































































































Boards 






























































Figure 26.18: The L-table contains one row for each of the 512 possible 
patterns of blanks and dots on a Flipper board, and one column for each 
of the 9 possible actions. Each entry is called an L-value. 


This table is big, but not too big. There are 512 rows, and 9 columns, 
for a total of 512x9=4608 cells. 


We're going to use the L-table to help us choose the highest-rewarding 
action in response to each board. To make that happen, we’re going to 
fill each entry with a number, based on experience, that tells us how 
good the corresponding move is. This value is the score we referred to 
in the previous section, telling us how good our move was. 


There are two steps to using the L-table: filling it in, and using it to 
play. 


Before we start assigning values to the L-table, we'll initialize it with a 
O in every cell. 


As we play a game, we'll keep a record of all the moves we've played. 
When the game is over, we'll look back through our actions and 
rewards for the whole game, and determine a value to assign to each 
move. Then we'll combine this value with the number already in the 
corresponding cell to produce a new value for that move (we'll get to 
the mechanics for this in a moment). This is called the update rule. 
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As we play a game (either during learning the learning phase, or later 
for real), we’ll pick an action by looking at the corresponding row for 
the board at the start of that move. We'll use a policy to tell us which 
of the actions in that row we want to select. 


Let’s make these steps concrete. 


First, after each game (or episode), we need to determine the score 
we want to assign to each action we played. Let’s use the total future 
reward, or TFR, which we discussed in the last section. Recall that the 
TFR comes from lining up all our actions and their rewards, and then 
summing up all the rewards that came after that action. 


While playing the game, every move along the way got an immedi- 
ate reward of 0, but the final move got a positive reward based on the 
game’s length. This means that the TFR for each action we took along 
the way will be the same as this final reward. So in a short game, this 
shared TFR will be larger than if the game was longer. 


Second, let’s pick a really simple update rule that says after each game, 
the TFR we compute for each cell replaces whatever was in there before. 
In other words, the TFR for each action in this game becomes the new 
value in the cell corresponding to that action and the board we were 
looking at when we took that action. 


This simple update rule is good for getting familiar with how the 
L-learning system works. But because it doesn’t combine our new 
experience with what we’ve learned before, this rule is a big reason 
that this algorithm isn’t going to perform well. 


We need a policy that tells us which move to play in response to a given 
configuration of the board. Let’s choose the action corresponding to 
the largest L-value in the row. If there are multiple cells with the same 
maximum value, we'll pick one of them at random. Figure 26.19 shows 
this graphically. 
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Figure 26.19: The policy step involves choosing one action in response to 
a board. Here we see a piece of the L-table. The row is the one that corre- 
sponds to the board state shown at the far left. Each column holds the 
most recently computed TFR that resulted when that action was taken 
at that board state. In L-learning, we choose the largest value. Here that 
means we'll flip the center-right tile. 


Let’s now assemble the pieces above into a learning algorithm. 


We begin by building the L-table. As we mentioned, it’s 512 rows by 9 
columns. We initialize every cell with a 0. Now we're ready to learn. 


Let’s imagine that we are the agent. The first game, or episode, begins 
as we are presented with the first board. Let’s assume it’s a board that’s 
randomly populated with dots (there might be none at all, or nine, or 
any number in between). 


At the start of this first game, as at the start of every game, we'll create 
a list in our private information store that will remember our moves 
as we play them. Each move will be represented by 4 items (we won’t 
be using them all right away). These items are the board state we were 
presented with (the row number in the table), the action we took (the 
column in that row), the reward we received for that action (always o 
except for the final action), and the state that resulted from our action 
(again, the corresponding row number in the table). Figure 26.20 
shows the idea. At the start of the game, the list is empty. 
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Figure 26.20: Each time we make a move, we append a bundle of 4 values 
to the end of a growing list. The bundle contains the starting state, our 
chosen action, the reward we received, and the final state the environ- 
ment returned to us after taking that action. Everything can be repre- 
sented by a number if we use a fixed scheme for assigning a number (that 
is, a row of the L-table) to each configuration of dots on the board, and 
a number to each of the 9 cells. The way we assign these numbers isn’t 
important as long as we're consistent. 


To make our first move, we look at the row of the L-table correspond- 
ing to our starting board, and the 9 numbers we find along that row. 
Since this is our first time through, all the values are o. Our policy tells 
us to pick one at random, so we'll pick a random column, and that’s 
the action we take. 


The environment flips that tile for us, either making a dot appear or 
disappear. The environment then reports back to us a reward, and the 
new board. We make a little bundle to represent this move: the board 
we started with, the action we just took, the reward we got back, and 
the new state that resulted. We stick that bundle onto the end of our 
list of moves. 


Because we're playing solitaire, the environment isn’t going to make 
any moves on its own. As soon as it’s sent us our feedback, the envi- 
ronment will tell us to take a new action. 


In response, we repeat the process from above. We'll look at the new 
board, find its row in the L-table, find the largest cell in that row, and 
report that as our action. We'll get back a reward and a new state, and 
we add a new bundle of the 4 items describing this move to our list. 


1497 


Chapter 26: Reinforcement Learning 


This goes on until the game is over. In that final piece of feedback, we 
get our only non-zero reward. It’s the final reward based on the num- 
ber of moves we played in the game, which drops off quickly, as we 
Saw in Figure 26.17. 


With that final, non-zero reward we know the game is over, so it’s time 
to learn from our experience. 


We start by looking at our bundles from our list of moves. We line up 
our board states and resulting moves, along with their rewards as in 
Figure 26.21. One by one, we look at each move and find its TFR by 
adding up all the rewards that came after that move. Since we know 
all the intermediate scores are O, every action’s TFR will be the final 
reward, but let’s anticipate our improved algorithm to come and go 
through the steps of finding the individual TFRs anyway. 

















move 1 move 2 move 3 move 4 final 
© 
@ @ 
reward O 0 0.5 





0.5 0.5 0.5 0.5 0.5 


Figure 26.21: Finding the TFR for each move. We add up the immediate 
reward for each move (shown directly underneath it) with the immediate 
rewards for all following moves. In our game where every immediate 
reward is O except for the final reward, these sums will all be the same. 


We then use our simple update rule and plunk each action’s TFR into 
the cell of the L-table corresponding to that action for that board, as 
shown in Figure 26.22. 
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Figure 26.22: Updating our L-table with the new TFR for each action we 
took in the game. We find the row corresponding to the board we were 
looking at when we took the action, and the column corresponding to the 
action we made. The new TFR replaces whatever was in that cell before. 





GG 
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If we want to learn some more, we go back up to the start of the process 
where we made our empty list of moves, and play a new game. When 
were done, we compute a TFR value for each action we selected, and 
store that in its corresponding cell (over-writing whatever was there 
before). Note that we don’t reset the L-table after each game, though, 
so with luck it will gradually fill up with TFRs as we play more episodes. 


When it’s time to stop training and start playing, we use the L-table to 
pick our moves just as before. That is, at each move we're presented 
with a board, so we find that row of the table, pick the largest L-value 
in that row, and select the action corresponding to that score. 


Let’s see how well our system works. 
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We'll start out by playing 3000 episodes of Flippers from start to fin- 
ish, so the L-table can get filled up pretty well. 


Figure 26.23 shows a game of Flippers played from start to finish after 
these 3000 episodes of training. It’s not a very nice result. There’s a 
simple 2-move solution that any human would spot: flip the left-mid- 
dle cell, and then the upper-left cell (or do it in the other order). Instead, 
our algorithm seems to meander randomly until it finally stumbles on 
a solution after 6 moves. 
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Figure 26.23: Playing a game of Flippers after 3000 episodes of training 
with the L-table algorithm. Read the game left to right. 
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The arrangement shown in Figure 26.23 is a rotation of the L-table 
in previous illustrations, to better fit the available area. Each vertical 
column represents one board configuration (or state). The 9 possible 
actions are shown in each row, highlighted in red. The thick black out- 
line shows the action that was selected from that list, leading to the 
new board in the column to its right. The shaded cell shows the action 
taken. If the move causes a dot to appear, the move is shown as a solid 
red dot. If the move causes the dot to go away, it’s shown as an out- 
lined red dot. The green bar below each board shows its L-value from 
the table. Larger bars correspond to larger L-values. 


Boards near the right have larger L-values than those near the left. 
That’s because those boards were sometimes the randomly chosen 
starting board for a game. If we picked a good move and won immedi- 
ately, or in just a few moves, the final reward was large. 


Returning to this game, starting from the position on the far left, the 
algorithm’s first move was to flip the cell in the lower-left, introducing 
a new dot. From that result, it then flipped the square in the middle 
of the leftmost column, again introducing a dot. From that position 
it then flipped the upper-left square, removing the dot that was there. 
The game continued in this way until it found a solution. 


Figure 26.24 shows another easy starting position. This one already 
has three in a row in the center column. The algorithm only needs to 
flip the two dots in the right column. Instead, it takes 6 moves. 
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Figure 26.24: Another easy game of Flippers, played after 3000 episodes 
of training. The board needed only 2 moves to win, but the algorithm 


used 6 moves. 
Figure 26.25 shows the same game as Figure 26.23 after doubling the 


We'd expect our algorithm to improve with more training, and it does. 
length of the training run to 6000 episodes. 
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Figure 26.25: The same game as Figure 26.23 after 6000 episodes of 
training. The algorithm has found the short, easy path to success. 


This is very nice. The algorithm found the easy answer and went right 
for it. 


Figure 26.26 shows the same game as Figure 26.24. 
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Figure 26.26: The same game as Figure 26.24 after 6000 episodes of 
training. Again, the algorithm found the fastest solution and used it right 
away. 


We seem to have created a great algorithm for learning and playing. So 
why did we label everything with “L” for lousy? It seems to be working 
just fine. 


It is just fine, as long as the environment remains completely predict- 
able. Remember that earlier in this chapter we discussed unpredictable 
environments. In reality, most environments will be unpredictable. 


Logic-based solitaire games, such as the Flippers game we've been look- 
ing at, are one of the few activities that are completely deterministic. 
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If our goals are to play only solitaire games in completely determinis- 
tic environments, where we are able to execute every intended move 
perfectly and the environment responds identically every time, then 
this algorithm isn’t so lousy. 


But such deterministic games are rare. For example, as soon as there’s 
a second player, there’s uncertainty, and the game becomes unpre- 
dictable. In any situation where the environment is not perfectly 
deterministic, the L-learning algorithm will flounder. 


Let’s see why, and then we'll see how to fix it. 


26.5.1 Handling Unpredictability 


Because we don’t have an opponent when playing Flippers on the com- 
puter, we have a completely deterministic system. Every time we make 
a move, we are guaranteed to get back the same result. 


But in the real world, even single-player activities can have unpredict- 
able events. Video games throw random surprises at us, a lawnmower 
can hit a rock and jump to the side, or an internet connection can stut- 
ter and we miss making the winning bid in an auction. 


Since handling unpredictability is so important, let’s introduce some 
artificial randomness into Flippers and see how our L-learning algo- 
rithm responds. 


Our model of randomness will take the form of a big truck that will 
drive by our playing area every now and then, shaking our board. 
Sometimes it’s enough to cause one or more random tiles to sponta- 
neously flip over. Of course, we still want to play good games and win, 
even in the face of such surprises. But our L-learning system is help- 
less in the face of this kind of event. 


It’s the combination of our policy and update rule that causes trouble. 
Remember that before we start learning, each row starts out with all 
o’s. When a training game is finally won, then every action gets the 
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same score, based on the length of the game, as we saw in Figure 26.22. 
As we continue to play our training games, the next time we come to 
that board we'll pick the cell with the largest value. 


Suppose that we’re in the midst of a training game. We’re looking at a 
board that we once received as a starting board, and we won it in two 
moves. The L-table values for each of those moves have large scores. 
So we select the first high-scoring move, preparing to win on the next 
flip. 


But just after our first move, the big truck comes rumbling by, shak- 
ing our board and flipping a tile. Playing from this board forward, we 
end up requiring lots of moves before winning. So the TFR that will 
ultimately come from playing that action will be less than if the truck 
hadn’t come by. 


And here’s the problem: that smaller value will overwrite the previous 
value in every cell that led to this long game. In other words, because 
of that event, every action we played will see its L-value lowered. In 
particular, that great starting move that led to victory in just one more 
move will now have a low score. 


When we encounter this board again in a later game, we might find that 
one of the other cells has a larger value now than the cell we picked the 
last time, so we won't pick the great move. 


In other words, this one-time random event will cause us to stop mak- 
ing the best move we'd found up to that point. We will have “forgotten” 
that this was a great move, because a random event turned it into a 
bad move once. That low score made it unlikely that we’d ever choose 
that move again. 


Let’s see this problem in action. Figure 26.27 shows an example where 
there are no unpredictable events. We start with a board with three 
dots, and we find that the largest value in that row of the table is 0.6, 
corresponding to a flip of the center square. So we make that move, 
and supposing the next move is also well-chosen, we have a victory in 
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2 moves. The reward of 0.7 replaces the 0.6 that was there for our first 
move, cementing this move’s status as the one to make. Everything 
went right. 
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Figure 26.27: When there are no surprises, our algorithm works well. (a) 
The row of the L-table for the starting board. The best move has a value 
of 0.6, corresponding to a flip of the center tile. We will select and play 
that move. (b) The game plays out and is won in just one more move. The 
reward is 0.7. (c) The value of 0.7 overwrites the previous value for all 
table entries that led to this success. 


But now let’s introduce our rumbling truck in Figure 26.28. Just after 
we flip the center tile, the truck shakes the board and the bottom-right 
tile flips. This puts us on a whole new path. Let’s suppose the algorithm 
finally finds victory after four more moves. The total is five moves, and 
the reward of 0.44 is placed in every cell that led to this victory. 
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Figure 26.28: (a) When a truck rumbles by, it flips the lower-right square, 
causing the game to take 5 moves to win. (b) The new reward of 0.44 
overwrites the old value of 0.6. This cell is no longer the highest-scoring 
cell in the row. 


This is terrible. In one quick stroke we have “forgotten” our best move. 
In this example, there are two other actions that now have better scores. 
The next time we come to this board, the cell with score 0.55 will be 
picked, which will not place us one move away from victory as before. 
In other words, our best move is now forgotten, and we’re going to 
always play a worse move. 


Someday the truck might rumble by again and help us remember this 
cell, but that might take a very long time. Until that happens, we’ll 
make this inferior move on this board every time. And by the time the 
truck does come by and sets this move right again, others will have 
gone wrong. 


In other words, the L-table will almost always be inferior to what it 
ought to be, and thus on average our games will be longer and we'll get 
lower rewards. 


One surprise and we forgot how to play this board well. 
That’s why we called this algorithm “lousy.” 
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But all is not lost. We looked at this algorithm because the lousy ver- 
sion can be improved. Most of the algorithm is fine. We only need to 
fix how it fails in the face of unpredictability. 


From now on, we will always assume that when we play Flippers, that 
big truck might come thundering along, creating unpredictability in 
the form of occasionally flipping a random tile. 


In the next section we'll see how to handle this kind of unpredictable 
event gracefully, and produce an improved learning algorithm that 
works well. 


26.6 Q-learning 


Without too much effort we can upgrade L-learning to a much more 
effective algorithm that is in common use today, called Q-learning 
(the Q is for “quality”) [Watkins89] [Eden15]. Q-learning looks a lot 
like L-learning, but naturally it instead fills up Q-tables with Q-values. 
The big improvement is that Q-learning performs well in stochastic 
environments. 


To get from L-learning to Q-learning we'll make three upgrades: how 
we compute new values for Q-table cells, how we update existing val- 
ues, and the policy we use for choosing an action. 


The Q-table algorithm starts with two important principles. First, 
we expect uncertainty in our results, so we build it in from the start. 
Second, we work out new Q-table values as we go, rather than waiting 
for the final reward. 


This second idea lets us work with games (or processes) that go on for 
a very long time, or perhaps never reach a conclusion. By updating as 
we go, we're able to develop our table of useful values even if we never 
get a final reward. 
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To make this work we'll need to also upgrade L-learning’s super sim- 
ple reward policy from the last section. Rather than always reward o 
except for the final move, we'll instead return immediate rewards that 
estimate the quality of each action as soon as it’s taken. 


26.6.1 Q-values and Updates 


Q-values are a sneaky way to approximate the Total Future Reward 
even when we don’t know how things are going to end up. 


To find a Q-value, we'll add together the immediate reward, plus all 
the other rewards that are yet to come. That’s just the definition of the 
Total Future Reward. But the way we'll estimate the rewards yet to 
come is to get them from the next state. 


Recall that in Figure 26.20 we said we'd retain four pieces of informa- 
tion for every move: the starting state, the action we chose, the reward 
we got, and the new state that action landed us in. We'll use that new 
state to get the rest of the future rewards. 


The key insight is to notice that our next move will begin with that new 
state, and by following our policy we'll always select the action whose 
cell has the greatest Q-value. If that cell’s Q-value is the Total Future 
Reward for that action, then adding together that cell’s value with our 
immediate reward gives us the current cell’s Total Future Reward. This 
works because our policy guarantees us that we'll always pick the cell 
with the largest Q-value for any given board state. 


Let’s make this more concrete with an analogy. Suppose we’re saving 
up money for a big purchase that we want to make at the end of the 
month, so we put some money in the bank every day. When we deposit 
ten dollars on the 11th day of the month, we’d like to know if we'll have 
enough money saved at the end of the month to afford our purchase. 
There’s no way to predict that right now. But suppose that we knew 
(somehow) that the total amount of money we'll deposit starting from 
tomorrow the 12th through the end of the month will be 200 dollars. 
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Then we can add that 200 dollars to be deposited to the 10 dollars 
we just saved, and know that we'll have 210 dollars at the end of the 
month. 


In the same way, the Q-table uses the Total Future Reward for the next 
state, which sums up all the rewards from that state to the end of the 
game, and we can add that to our immediate reward at this state to get 
the current Total Future Reward. 


If multiple cells in the next state share the maximum value, then it 
doesn’t matter which one we pick when we get there. All we care about 
now is the Total Future Reward that will come from the next action. 


Figure 26.29 shows this idea visually. Note that the value we compute 
in this step isn’t the final Q-value, but it’s almost there. 
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Figure 26.29: An incomplete process for computing a new Q-value fora 
cell. The new value is the sum of two others. The first value is the imme- 
diate reward for taking the action that cell corresponds to, here 0.2. The 
second value is the largest Q-value of all the actions belonging to the 
new state, here 0.6. 


The step that’s missing is where Q-learning accounts for randomness. 


Rather than use the value of the next cell, we use the discounted value 
of that cell. Recall that this means we multiply it by our discount fac- 
tor, anumber from 0 to 1, often written as gamma (y). As we discussed 
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earlier, the smaller the value of gamma, the less certain we are that 
unpredictable events in the future won’t change this value. Figure 
26.30 shows the idea. 
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Figure 26.30: To find the Q-value, we modify Figure 26.29 to include the 
discount factor y , which reduces the future rewards based on how confi- 
dent we are that they won't be changed by future, unpredictable events. 


Note that the many multiplications in the Discounted Future Reward 
shown in Figure 26.11 are automatically handled by this scheme. The 
first multiplication is included here explicitly. The multiplication for 
the state beyond that is accounted for when the Q-values in the cells 
for the next state are evaluated. 


Now that we've calculated a new value, how do we update the current 
value? We saw during L-learning that simply replacing the current 
value with the new one is a poor choice in the face of uncertainty. 
But we want to update the cell’s Q-value in some way, or we'll never 
improve. 


The Q-learning solution to this puzzle is to update the new cell’s value 
as a blend of the old and new values. The amount of blending is left up 
to us as a parameter that we specify. 
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The blend is controlled by a single number between oO and 1, usually 
written as the lower-case Greek letter a (alpha). At the extreme value 
of alpha=o, the value in the cell doesn’t change at all. At the other 
extreme value of alpha=1, the new value replaces the old one. Values 
of alpha between o and 1 blend, or mix, the two values, as shown in 
Figure 26.31. 


new value 


old value 





0 alpha 1 


Figure 26.31: The value of alpha (a) lets us blend smoothly from the 
old value (when alpha=o) to the new value (when alpha=1), or 
any value in between. 


The parameter alpha is called the learning rate, and it’s left up to 
us to set it. It’s unfortunate that this is the same term that’s used by 
the update step of backpropagation, but usually context makes it clear 
which “learning rate” we're referring to. 


In practice, we usually set alpha to a value close to 1, such as 0.9 or even 
0.99. These values near 1 cause the new values to dominate the value 
stored in the cell. For instance, when alpha = 0.9, the new value stored 
in the cell is 10% of the old value, and 90% of the value we calculated 
above. But even a value of 0.99 is very different than 1, because remem- 
bering even 1% of the old value is often enough to make a difference. 
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With this value set, we run our system through some training and see 
how it does. Then we can adjust the value based on what we see, and 
try again, repeating the process until we’ve found the value that seems 
to work best. We usually automate this search so we don’t have to do it 
ourselves. 


The elephant in the room is that this whole argument has been based 
on having the correct Q-values in the next state, even before we get 
there. But where did they come from? And if we have the correct 
Q-values already, then why do any of this in the first place? 


These are fair questions, and we'll return them after we look at the 
new policy rule. 


26.6.2 Q-Learning Policy 


Recall that the policy rule tells us which action to select when we’re 
given a state of the environment. We use this policy both while learn- 
ing and later, when playing actual games. 


The policy we used in L-learning is to always select the action with the 
highest L-value in the row of the table corresponding to the current 
board. That makes sense, since we’ve learned that this is the action 
that will bring us the highest rewards. 


But this policy ignores the explore or exploit dilemma, leaving us sol- 
idly in the “exploit” camp. If any one action managed to get a score 
higher than the others, then it could become the only one we select for 
a long time. In an unpredictable environment, the move that brought 
the best rewards sometimes may not bring us the best reward other 
times. And completely untried moves could be far better, if only we’d 
give them a chance. 


Still, we don’t want to pick moves at random, because we do want to 
favor the ones that we know will lead to high rewards. We just don’t 
want to do that every time. 


Q-learning picks a middle road. 
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In Q-learning, instead of always picking the action with the highest 
Q-value, we almost always pick the action with the highest Q-value. 
The rest of the time we pick one of the other values. Let’s look at two 
popular policies for doing this. 


The first approach we'll look at is called epsilon-greedy or epsi- 
lon-soft (these refer to the Greek lower-case letter € (epsilon), so they 
sometimes appear as €-greedy and €-soft). The algorithms are almost 
the same. We pick some number € between 0 and 1, but usually it’s a 
small number quite close to 0, such as 0.01 or less. 


Each time we’re at a row and ready to choose an action, we ask the 
system for random number between o and 1, chosen from a uniform 
distribution. If the random number is greater than epsilon, then we 
proceed as usual and pick the action with the greatest Q-value in the 
row. But in that occasional case when the random number is less than 
epsilon, we select an action at random out of all the other actions in 
the row. In this way we'll usually pick the most promising choice, but 
infrequently we'll select one of the other actions and see where it leads 
us. Figure 26.32 shows this idea graphically. 
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Figure 26.32: The epsilon-greedy policy says that each time we want to 
pick an action from a row of the Q-table, we first pick a random number 
from 0 to 1. If it’s greater than the value of epsilon (typically something 
around 0.01), we pick the largest action in the row. Otherwise, we pick 
one of the other actions at random. 


The other policy we'll look at is called softmax. This works in a way 
similar to the softmax layer that we discussed in Chapter 17. In this 
approach, we temporarily transform the Q-values in the row so that 
they add up to 1. This lets us treat the resulting values as a discrete 
probability distribution, and then we select one of entries according to 
those probabilities. 


In this way we'll usually get the action with the largest score. 
Infrequently, we'll get the value with the second-highest score. Even 
less frequently, we'll get the value with the third-highest score, and so 
on. Figure 26.33 illustrates the idea. 
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Actions 





20% 5% 10% 15% 30% 15% 5% 


Figure 26.33: The softmax policy for picking an action starts with tempo- 
rarily scaling all the actions in the row so that they add up to 1. Then 
we can pick a random number from this probability distribution, so the 
probability of getting each action is given by its Q-value relative to the 
sum of all the Q-values in the row. 


An attractive quality of this scheme is that the probabilities of choosing 
each action always reflect the most current Q-values of all the actions 
associated with a given state. So as the values change over time, so too 
do the probabilities of picking the actions. 


The particular calculations carried out by softmax can sometimes lead 
to the system not settling down on a good set of Q-values. An alter- 
native is the mellowmax policy, which uses slightly different math 
[Asadi17]. 
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26.6.3 Putting It All Together 


We can summarize the Q-learning policy and update rule in a few 
words and a diagram. 


In words, when it’s time for a move, we use the current state to find the 
appropriate row of the Q-table. We then select an action from that row 
according to our policy (either epsilon-greedy, epsilon-soft, or soft- 
max). We take that action, and get back a reward and a new state. Now 
we want to update our Q-value to reflect what we’ve learned from the 
reward. We look at the Q-values in that new state and select the larg- 
est one. We discount that by how much we think the environment is 
unpredictable, add it to the immediate reward we just got, and blend 
that new value with the current Q-value, producing a new Q-value 
which we save. 


Figure 26.34 summarizes the process. 
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Figure 26.34: The Q-learning policy and update procedure. (a) When we 
have a state, we look through the corresponding row of the Q-table, and 
use our policy to pick an action, here shown in red. That action is commu- 
nicated to the environment. (b) The environment responds with a reward, 
and a new state. We find the row of the Q-stable corresponding to the 
new state, and select the largest reward there. We discount this reward 
by multiplying it by gamma, and then add it to the immediate reward 
for this move, giving us a new value for the action we originally chose. 
We blend the old value and new value using alpha, and that new value is 
placed into the original action’s cell in the Q-table. 


The best values of the learning rate alpha and the discount factor 
gamma have to be found by trial and error. These factors depend inti- 
mately on the specific nature of the environment we’re working in and 
the data we’re working with. Experience and intuition often give us 
good starting points, but nothing beats traditional trial-and-error to 
find the best values for any particular learning system. 


26.6.4 The Elephant in the Room 


Earlier we promised to return to the problem that we needed to have 
accurate Q-values in order to evaluate the update rule, but those val- 
ues themselves were computed by the update rule using the values that 
came after them, and so on. How can we use data that we haven’t cre- 
ated yet? 


Here’s the beautiful, simple answer to that problem: we ignore it. 
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Incredibly enough, we can start out the Q-table with all o’s, and then 
start learning. In the beginning, the system will act crazy, because 
there’s nothing in the Q-table to help it pick one cell over another. 
It will pick one of the cells at random and play that move. All of the 
actions in the resulting state will also be 0, so the update rule, no mat- 
ter what values we use for alpha and gamma, will keep the cell’s score 
at O. 


Our system will play games that look chaotic and foolish, making terri- 
ble choices and missing obvious good moves. 


But eventually, the system will stumble onto a victory. That victory 
will get a reward of a positive number, and that reward will update the 
Q-value of the action that led to it. Some time later, an action that can 
lead to that action will incorporate some of that great reward, because 
of the step in Q-learning that looks ahead to the next state. And that 
ripple effect will continue to slowly work backward through the sys- 
tem, as new games fall into the states that lead to states that previously 
led to victory. 


Note that the information isn’t actually moving backwards. Every game 
is played from beginning to end, and every update is made immedi- 
ately after each move. The information seems to move backwards 
because Q-learning involves the step of looking forward one move 
when evaluating the update rule. So the score from the next move is 
able to influence the score for this one. 


At some point, thanks to our policy that sometimes tries out new 
actions, every move will eventually lead to a path to victory, and those 
values will also influence earlier and earlier actions. Eventually the 
Q-table will fill up with values that accurately predict the rewards of 
each action. Further playing will serve to only improve the accuracy of 
those values. This process of settling into a consistent solution is called 
convergence. We say that the Q-learning algorithm converges. 


We can prove mathematically that Q-learning converges [Melo15]. 
This kind of proof guarantees that the Q-table will gradually get better. 
What we can’t say is how long that will take. The larger the table, and 
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the more unpredictable the environment, the longer the training pro- 
cess will require. The speed of convergence also depends on the nature 
of the task the system is trying to learn, the feedback provided, and of 
course our chosen values for the learning rate alpha and the discount 
factor gamma. As always, there’s no substitute for trial-and-error 
experimentation to learn the specific idiosyncrasies of any particular 
system. 


Note that the Q-learning algorithm very nicely addresses both of the 
problems we discussed earlier. 


The credit assignment problem asks us to make sure that the 
moves that lead up to a victory are rewarded, even when the environ- 
ment isn’t providing that reward. The nature of the update rule takes 
care of this, propagating the rewards for successful moves backwards 
from the final step that led to victory all the way back to the very first 
move. 


The algorithm also addresses the explore or exploit dilemma by 
using epsilon-greedy or softmax policies. They both favor choosing 
actions that have proven to be successful (exploitation), but they also 
sometimes try the other actions just to see what might come of them 
(exploration). 


26.6.5 Q-learning in Action 


Let’s put Q-learning to work, and see if it can learn how to play Flippers 
in an unpredictable environment. One way to measure the algorithm’s 
performance is to have the trained model play a large number of ran- 
dom games, and see how long they take. The better the algorithm has 
gotten at finding good moves and eliminating bad ones, the fewer 
moves each game should require before reaching victory. 


A quick analysis of the game reveals that it should never take more 
than 6 moves to win, with most taking 3 or 4. We'd like to see our algo- 
rithm find those short solutions, winning every game in 6 moves or 
less. 
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To see the effect of training on the algorithm, we'll look at plots of the 
lengths of a large number of games for different amounts of training. 


Our plots show the results of playing games that start with each of the 
512 possible patterns of dots and blanks, in an environment with a 
considerable degree of unpredictability. We played 10 games for each 
starting board, for a total of 5,120 games. We cut off any game that ran 
for more than 100 steps. 


We set alpha to 0.95, so each cell retained just 5% of its old value when 
it was updated. This way we don’t completely lose what we’ve learned 
before, but we are expecting new values to be better than old ones, 
since they'll be based on improved Q-table values when they pick the 
next move. 


To select moves, we used an epsilon-greedy policy with a relatively 
high epsilon of 0.1, encouraging the algorithm to seek out new moves 1 
time out 10. 


We introduced a lot of unpredictability by simulating our random 
truck coming by after each move with a probability of 1 in 10, flipping 
over a single random tile each time. To account for this, we set the dis- 
count factor gamma to 0.2. This low value says we’re only 20% sure 
that the future will play out the same way each time, because of the 
influence of those random events. We set this higher than the noise 
level we already know about (10%), because we expect that most well- 
played games will only be 3 or 4 moves long, so they are less likely to 
see a random event than a game of 10 or more moves. 


These values of alpha, gamma, and epsilon are all basically informed 
guesses. Gamma in particular was chosen based on our knowledge of 
how often random events would occur, which we rarely know ahead of 
time. In a real situation we’d experiment with our parameters to find 
what works best for this game and this amount of noise. 


Figure 26.35 shows the game lengths after training for just 300 games. 
The algorithm found a lot of quick wins with only this small amount of 
training. 
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Trained on 300 games, max length=36 
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Figure 26.35: The number of moves required to win each of 5120 games 
(we played each of the 512 starting boards 10 times), using a Q-table that 
had been trained for 300 games. 


The “instant wins” are in the first column, corresponding to O moves. 
These are games whose starting boards already have just three dots, 
arranged in a vertical column or horizontal row. Since there are six 
possible winning game configurations, and we ran through all the pos- 
sible board configurations 10 times each, we started with a winning 
board 60 times. 


Since no game in Figure 26.35 hit our 100-move cutoff, we can see that 
the algorithm never fell into a long-lived loop. A loop might just be 
two states alternating forever, or a long string of them that wraps back 
around on itself. Loops are possible in Flippers, and there’s nothing 
in the basic Q-learning algorithm that explicitly prevents the system 
from getting into a loop. 


We might say that the system “discovered” that loops don’t get to vic- 
tory and thus don’t bring any rewards, so it learned to avoid them. If at 
some point it did return to a previously-visited state, either by making 
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that move or as the result of a randomly-introduced flip, the relatively 
high value of epsilon meant it had a good chance of eventually picking 
a new action and thereby going off in a new direction. 


Let’s raise the number of training games to 3000, as in Figure 26.36. 


1H0e Trained on 3000 games, max length=20 


800 
600 


400 


Number of games 


200 





0 5 10 15 20 25 
Game length 


Figure 26.36: The number of games of different lengths resulting from 
playing 5120 games, based on a Q-table trained by playing 3000 games. 


The algorithm has learned a lot. The longest game is now just 20 moves, 
with most games being won in 10 moves or less. It’s nice to see the 
denser clustering around 4 and 5 moves. 


Another way to peek into the algorithm’s performance is to plot the 
values of the Q-table itself. In Figure 26.37 we show the Q-table. Each 
row corresponds to a single board configuration. In each row we show 
9 dots, one for the Q-value of each cell in that row. The horizontal posi- 
tion of the dot shows its Q-value. We can see that the algorithm has 
lots of cells that are untried (so they hold their starting value of 0), and 
most others have a very small positive value, probably because they 
contributed to a game that took a long time to win. 
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Figure 26.37: The Q-table after 3000 episodes. Each row corresponds to 
one board. The horizontal position of each dot shows its Q-value. We can 
see many actions still have their starting value of 0, while most have been 
involved in at least one winning game returning some reward. 


Some of the values in Figure 26.37 are larger than 1. This is a natural 
result of adding the immediate reward with the Q-value of the next 
action to be executed. 


Let’s look at a couple of typical games played after these 3000 epi- 
sodes of training. Figure 26.38 shows a game, played left to right. The 
algorithm needed 8 moves to win. 
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Figure 26.38: Playing a game of Flippers after training Q-learning for 


3000 episodes. 
Figure 26.38 is not an encouraging result. Just by looking at the start- 
ing board, we can see at least four different ways to win this game in 
four moves. For example, flip the lower-left square and then flip the 
three dots in the middle and rightmost columns. But our algorithm 


seems to be flipping over tiles at random. It eventually stumbles onto 


a solution, but it’s definitely not an elegant result. 
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instead it adds dots to both empty cells, completely filling the board, 


Figure 26.39 shows another game that took 8 moves to win. There’s 
already a row of 3 dots along the bottom of the board, so the computer 
needs to only flip over the four dots in the middle and top rows. But 
before then removing six of them one by one. 
















































slele) (eisie| folstel) Pelelely feieley) fears OOOg MOOS 
TT felt Col ET er) Tie CT 
ee8 Cee Ceo Do Cece Ce ce Cer 
es[er) [e[ele eloe) eeler) [seo 
aaa Can aaa Peis) oecL 
sco [ele lorem Ceiem Cele! Cisiell Ciele 
OOO MOOOR seis] (eieier) (efole pong BoOs§ 
TT fel) |el_le Tiel) [lele} | [Tle TT los [Le 
sisiol [iele ole Cieiem Cele! elell Cele 
pong goon OCoRmCCOMEC.O ee) ea 
"| {elf [e_le ‘Tf | [lele} | [1 Ie! TT fo) [Ie 
Oelomy icicle elem [slelem [sisle! isle [siele 
slelet] (eieier) fOlster| (eteter) Peteler] (elotel) feleter) feletey) [efelo 
"lele| | jeele| | [ele | [lee] | F foie} | ele | [lee | | lela] | lee! 
Cielo [slelem [eelem [elem [ejceml [slelem [elem [slelem |siele 
elefer) [eletey | lele, | [eleley | feleey| [ele] [eele| [eee] [seo 
elefe| | oles; | lejele | [eles] | lefoje | [elee| | lejele | [elejo| |elele 
Cele folelems leisie [llc [elelem [ele] [elem [slelem |siele 
elle! | (eles) | Oleler| [eleler| feleey| [edie] [eee] [eee] [seo 
elele| J [Clee | lefele | [eles] | [efole | [elee| | lelele | [eleio| |elele 
oolong (isle [Isle [lolem [lelem [lelem [elm (lee [isle 
ey) Sect) eel) Se] Sos] GSS] SS] Sl ee 
eleie|] [olele|| lelele| | [eles] | lefole | Foleo} lelele | (eleiol] |elele 
eco (ice [isiom (lolem [lelom [jeje Liciom Lieiell [ele 

Omo 

ooo 

[ lele) 


1527 


Figure 26.39: Another game played after 3000 training episodes. 
mance to improve. After 3000 more training episodes (for a total of 


If we train the algorithm for more episodes, we’d expect its perfor- 
6000), we get the results of Figure 26.40. 


Chapter 26: Reinforcement Learning 


1200 Trained on 6000 games, max length=18 


1000 
800 


600 


Number of games 


400 


200 





0 5 10 15 20 
Game length 


Figure 26.40: The lengths of our 5120 games after training the Q-table 
with 6000 games. 


Compared to our results in Figure 26.36, after 3000 games of training, 
the longest game has decreased from 20 moves to 18, and the shorter 
games of just 3 and 4 steps have become more frequent. 


The contents of the Q-table after 6000 training steps are shown in 
Figure 26.41. 
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Figure 26.41: The Q-table for Q-learning after 6000 episodes of training. 


These charts give us an overall sense of how the algorithm is learning, 
but how is it actually performing when it plays a game? In fact, the 
algorithm has taken a huge jump in ability. 


Figure 26.42 shows the very same game as Figure 26.38, which 
required 8 moves to win. Now it takes just four moves, which is the 
minimum number for this board (though there’s more than one way to 
achieve it). 
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Figure 26.42: The game of Figure 26.38, solved more efficiently by 


Q-learning thanks to more training episodes. 


Figure 26.43 shows the same kind of improvement over Figure 26.39, 


which also took 8 moves before. And once again, this is the minimum. 
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Figure 26.43: The game of Figure 26.39, solved more efficiently by 


Q-learning thanks to more training episodes. 
Q-learning has done remarkably well even in this highly unpredictable 


learning environment, where a tile flipped over at random after 10% 
of the moves. It weathered that unpredictability and managed to find 


ideal solutions for most games, even with only 6000 training runs. 
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26.7 SARSA 


Q-learning does a great job, but it has a flaw that can reduce the accu- 
racy of the Q-values that it relies on. 


The Q-learning update rule that we saw in Figure 26.34 uses the larg- 
est Q-value in the state that comes after taking our action. In other 
words, the update rule assumes we're going to pick that action on our 
next move, and its calculations of the new Q-value are based on that 
assumption. This isn’t a crazy assumption, because both our epsi- 
lon-greedy and softmax policies will usually pick the most rewarding 
action. But the assumption will be wrong when one of those policies 
chooses one of the other actions. 


In other words, Q-learning builds Q-values by assuming we'll always 
pick the action with the largest value when we take our next move, but 
sometimes our policies pick one of the other actions. 


When our policy picks any action other than the one we used in the 
update rule, the calculation will have used the wrong data, and we'll 
end up with reduced accuracy in the new value that we compute for 
that action. Figure 26.44 shows this problem graphically. 


ve new state wa new state 


(a) (b) 


Figure 26.44: The Q-learning algorithm can make a mistake when 
computing the new value for an action. (a) We calculate the Q-value for 
the current state (red) by using the best action in the new state (blue). 
(b) Our policy sometimes picks a different action when we actually get to 
that state, causing our previous calculation to be wrong. 


1532 


Chapter 26: Reinforcement Learning 


It would be nice to keep all the virtues of Q-learning, but avoid making 
that mistake. 


We can do that by modifying Q-learning just a little, creating a new 
algorithm known as SARSA [Rummery94]. This is an acronym for 
“state-action-reward-state-action.” The “SARS” part we’ve had covered 
ever since Figure 26.20, when we saved the starting state (S), action 
(A), reward (R), and resulting state (S). What’s new here is the extra 
action “A” at the end. 


SARSA fixes the problem of choosing the wrong cell from the next 
state by choosing that next cell with our policy (rather than just select- 
ing the biggest one), and remembering the choice of action (that’s the 
extra “A” at the end). Then when it’s time to make our new move, we 
select the action that we computed previously and saved. 


In other words, we’ve adjusted the time when we apply our action-choos- 
ing policy. Instead of choosing our action at the start of a move, we 

choose it during the previous move and remember our choice. That 
lets us use the value of the action we really will use when building the 

new Q-value. 


Those two changes (moving the action-choosing step and remembering 
the action we chose) are all that differentiate SARSA from Q-learning, 
but they can make a big difference in learning speed. 


Let’s look at three successive moves using SARSA. The first move is 
shown in Figure 26.45. Because this is the first move, we'll use our 
policy to pick an action for this move. This is the only time we do this. 
Once we have our chosen action, we use our policy to pick the action 
for move 2. 
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Figure 26.45: Using SARSA in the first move of our game. (a) We use our 
policy to pick the current action. (b) We use our policy to pick our next 
action, and update our current Q-value with the Q-value for that next 
action. 


The second move is shown in Figure 26.46. Now we use the action we 
picked for ourselves last time, and pick the next action, which we use 
to determine the new value for the current action. 
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Figure 26.46: The second move using SARSA. (a) We make the action we 
picked for ourselves last time. (b) We pick the next action, and use its 
Q-value to update the current action’s Q-value. 
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The third move is shown in Figure 26.47. Here again we take the pre- 
viously determined action, and work out the action for the next, fourth 
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Figure 26.47: The third move using SARSA. (a) We take the action we 
determined during the second move. (b) We choose an action for the 
fourth move, and use its Q-value to improve the current action’s Q-value. 


Happily, we can prove that SARSA will also converge. As before, we 
can’t guarantee how long it will take, but it usually starts producing 
good results sooner than Q-learning, and improves them quickly after 
that. 


26.7.1 SARSA in Action 


Let’s see how well SARSA plays Flippers, using the same approach we 
took to Q-learning. 


Figure 26.48. shows the length of our 5120 games after 3000 training 
episodes using SARSA. For this plot and those following, we continue 
to use the same parameters as for the Q-learning plots: the learning 
rate alpha is 0.95, we introduce a random flip with a probability of 0.1 
after every move, the discount factor gamma is 0.2, and we pick moves 
with an epsilon-greedy policy with epsilon set to 0.1. 
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Trained on 3000 games, max length=15 
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Figure 26.48: The lengths of 5120 games using SARSA after training 
with 3000 games. Note that only a few games required more than the 
maximum of 6 moves. 


This is looking great, with most values clustered around 4. The longest 
game is only 11 steps, with very few longer than 8. 


As before, let’s plot the values of the Q-table itself. In Figure 26.49 we 
show the Q-table. Each row corresponds to one of the game’s states, 
and each dot in that row corresponds to the value of one table cell. 
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Figure 26.49: The Q-table for SARSA after 3000 episodes of training. 


Let’s look at a couple of typical games. Figure 26.50 shows a game, 
played left to right. The algorithm needed 7 moves to win. 
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Figure 26.50: Playing a game of Flippers after 3000 episodes of training 


to SARSA. 


Figure 26.51 shows another game that took 8 moves to win. 
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Figure 26.51: Another game played by SARSA after 3000 training episodes. 
Figure 26.52 shows the lengths of our 5120 games after 6000 training 


As always, more training should result in better performance. So as 
episodes. 


before, let’s double our training to 6000 episodes. 
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Figure 26.52: The lengths of our 5120 games using SARSA after training 
for 6000 games. Note how much shorter most of the games have become, 
and that none of the games got caught in a loop. 


The longest game has gone down from 15 to 14, which isn’t much to 
shout about, but the number of short games of lengths 3 and 4 is now 
even more pronounced. 


The Q-table after 6000 training steps is shown in Figure 26.53. 
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Figure 26.53: The SARSA Q-table after 6000 training episodes. 


These charts give us an overall sense of how the algorithm is learning, 
but how is it actually performing when it plays a game? In fact, the 
algorithm has taken a huge jump in ability. 


Figure 26.54 shows the very same game as Figure 26.50, which 
required 7 moves to win. Now it takes just 3 moves, which is the min- 
imum for this board (though again, there’s more than one way to win 
with just 3 moves). 
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Figure 26.54: The same game as Figure 26.50, after 3000 more training 


episodes. 


Figure 26.55 shows the same kind of improvement over Figure 26.51, 


which also took 8 moves before. Now it took only 4. 
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Figure 26.55: The same game as Figure 26.51, after 3000 more training 


episodes. 


26.7.2 Comparing Q-learning and SARSA 
Let’s compare the Q-learning and SARSA algorithms. Figure 26.56 


shows the lengths of all 5120 possible games, after 6000 games of 
from the previous plots because they were generated by new runs of 


training by Q-learning and SARSA. These results are slightly different 
the algorithm, so the random events were different. 
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Figure 26.56: Comparing game lengths after 6000 training games for 
both Q and SARSA. SARSA's longest game was 11 steps, while Q-learning 
went as high as 18. 


They’re roughly comparable, but Q-learning produces a few games that 
are longer than SARSA’s maximum of 12. 


More training would help Q-learning, but it would help SARSA as well. 
We’ve increased the training length by a factor of 10, so we trained on 
60,000 games each. The results are shown in Figure 26.57. 
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Figure 26.57: The same training scenario as in Figure 26.56, but now 
we ve trained for 60,000 games. 
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At this level of training, SARSA is doing an excellent job on Flippers, 
with almost all games coming in at 6 moves or less (a very few games 
required 7 moves). Q-learning is faring slightly worse overall, needing 
up to 16 steps to solve some of its games, but it too is greatly concen- 
trated in the region of 4 moves and under. 


Another way to compare Q-learning and SARSA for this simple game 
is to plot the average game length after increasingly long training ses- 
sions. This gives us an idea of how effectively they’re learning to win 
the game. Figure 26.58 shows this for our Flippers game. 
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Figure 26.58: The length of the average game for training sessions from 
1 to 100,000 episodes (in increments of 1000). 


The trend here is easy to see. Both algorithms drop quickly and then 
level off, but SARSA always performs better, ultimately saving almost 
a half move on every game (that is, in general, it will play one less move 
for every 2 games). 
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By the time we reach 100,000 training games, both algorithms appear 
to have stopped learning, or at least they've stopped improving. It 
seems likely that the Q-tables of each algorithm have settled down into 
stable states, changing a little bit over time due to the random flips 
introduced by the environment. 


To see if that’s true, let’s look at the average value in each Q-table. We’d 
expect to find that for any amount of training, SARSA will have filled 
in fewer Q-table cells, since its superior look-ahead lets it avoid com- 
puting Q-table values for bad moves more efficiently than Q-learning. 
This would give SARSA a lower average cell score, since it would be 
leaving more cells at their starting value of 0. Figure 26.59 shows these 
average values. 
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Figure 26.59: The average value of the Q-table for Q-learning and SARSA. 
SARSA has a consistently lower average value. 


As we'd expect, the table values start with low values after the first 
1000 training games, but as time goes on the average Q-table value 
increases. 
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test our explanation that SARSA has a lower average Q-table value 


because it’s filling in fewer cells, let’s actually count how many cells 
are not zero for each algorithm. Figure 26.60 shows these results. 
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Figure 26.60: The percentage of cells that are not zero in both Q-learning 
and SARSA, after training for 1000 to 100,000 games in increments of 
1000. 


This confirms our intuition that SARSA is filling in fewer cells than 
Q-learning, though again it looks like it’s catching up to Q-learning 
after huge numbers of training runs, and random events cause it to 
explore more cells it had never visited before. 


As these plots demonstrate, both Q-learning and SARSA do a great job 
of learning to play Flippers. SARSA has the advantage because for any 


giv 


en amount of training, SARSA’s games will generally be shorter. 
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26.8 The Big Picture 


Let’s step back and take the big view of the system we’ve built. There’s 
an environment and an agent. The environment provides the agent 
with two sets, or lists, of numbers (the state variables and the available 
actions). The agent uses these two sets, along with whatever private 
information it has internally, to select one of the values from the list of 
actions, which it returns to the environment. In response, the environ- 
ment gives the agent back a number and two new lists. 


That’s really all that’s going on: small exchanges of numbers. 


Interpreting the lists as boards and moves was great, because it lets us 
think of Q-learning in terms of learning to play a game. But the agent 
doesn’t know it’s in a game, or that there are rules, or really much of 
anything. It just knows that two lists of numbers come in, it picks one a 
value from one of the lists, and then a reward value and a couple of new 
lists arrive in response. It’s remarkable that this little process could 
do much that’s interesting at all, but if we can find a way to describe 
our environment, and actions on that environment, using sets of num- 
bers, and we can find even the crudest way to distinguish a good action 
from a bad one, this algorithm can learn how to perform high-quality 
actions. 


This worked for our simple game of Flippers, but how practical is all of 
this Q-table stuff in practice? In Flippers, there are 9 squares and each 
can have a dot or not, so the game needs a Q-table with 512 rows and 
9 columns, or 4608 cells. In a game of tic-tac-toe, there are 9 squares, 
and each can have one of 3 symbols: blank, X, or O. The Q-table would 
for this game would need 20,000 rows and 9 columns, or 180,000 cells 


That’s big, but not ridiculously big for a modern computer. But what 
if we want a slightly more challenging game? Rather than play tic-tac- 
toe on a 3 by 3 board, we'll play on a 4 by 4 board. There are a bit more 
than 43 million such boards, so our table would have 43 million rows 
and 9 columns, or a bit under 390 million cells. That’s getting pretty 
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big, even for modern computers. Let’s increase it just one more modest 
step, and play tic-tac-toe on a 5 by 5 board. That hardly seems outra- 
geous. Yet that board has almost 850 billion states. If we get a little 
ambitious and play on a 13 by 13 board, we'll find that the number of 
states is more than the number of atoms in the universe [Villanuevai5]. 
In fact, it’s roughly the number of atoms in one billion universes. 


Storing the table for this game is obviously impossible, but it’s an 
entirely reasonable thing to want to do. Even more reasonably, we 
might want to play Go. The standard board for the game of Go has 
a grid of 19 by 19 intersections, and each intersection can be empty, 
have a black stone, or a white stone. So this is like our tic-tac-toe board, 
but unfathomably bigger. We’d need a table whose rows would have 
labels requiring 173 digits. Such numbers are not just wildly impracti- 
cal, they’re incomprehensible. 


Yet this is the basic strategy that was used by the Deep Mind team to 
build Alpha Go, which famously beat a world champion human player 
[DeepMind16]. They did it by combining reinforcement learning with 
deep learning. 


One of the key insights in this deep reinforcement learning 
approach was to eliminate explicit storage of the Q-table. We can think 
of the table as a function that takes a board state as input, and returns 
a move number and Q-value as output. As we’ve seen, neural networks 
are great at learning how to predict things like this. 


So we can build a deep learning system that takes the same board 
input, and predicts the move number Q-value we'd get if we really did 
keep the table around. With enough training, this network can become 
accurate enough that we can abandon the Q-table and use just the 
network. 


Training a system like this can be challenging, but it can be done, with 
excellent results [Mnih13] [Matiisen15]. 
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26.9 Experience Replay 


Even with epsilon-greedy policies and random environmental events, 
reinforcement learning can get stuck in ruts. It can also gradually for- 
get useful strategies if they haven’t been needed for a while, requiring 
them to be re-discovered again later. 


One way to reduce these problems is to use experience replay 
[Wawrzynski13]. 


The idea of experience replay is like that of a musician who returns to 
playing an instrument after years of not touching it. She’d probably 
have to go back to her early exercises and simple pieces. She wouldn’t 
be learning them for the first time, but the process of going through 
them again would re-kindle old habits of muscle and mind, helping 
her remember dormant skills. 


To apply this idea to reinforcement learning, we save sequences of 
states and actions that we want the system to remember. Then during 
training, when we come to the starting state of one of these sequences, 
we “play it back” to the learning system. 


The idea is that we temporarily disconnect the environment, and 
replace it with the sequence of feedback messages that we saved. 


We ask the system to make a choice, and then we over-write this with 
the choice it selected when we made the recording. We then give the 
system the same feedback (reward, new actions, and new state) that 
it got the last time. The system will update its Q-values based on this 
action, and then we ask it to make a new choice. We over-write that 
with the next saved choice in the recording, and repeat the process 
above. We do this for every action in the recording. 


When the recording is over, we reconnect the environment to the agent, 
and let them continue as usual. 
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We might think of the whole thing as forcing the agent to re-live an 
old memory, except that it will react to those memories as though they 
were real. 


The upshot is that each remembered action will experience a little 
boost to its Q-value, making it a more likely contender to be selected 
in the future. 


Adding experience replay to a reinforcement-learning system can help 
it “remember” what it has learned, even if periods go by when that 
knowledge is not used. Instead of fading away, the successful actions 
have their scores bumped up a bit, so they remain attractive choices in 
the future. 


26.10 Two Applications 


Let’s briefly look at two applications of reinforcement learning. 


The first considers the game of Go. As we mentioned before, this is a 
game with an awe-inspiring number of states. Although chess-playing 
programs were a focus for many years, Go has been long considered a 
harder challenge [Levinovitz14]. 


The AlphaGo team used deep reinforcement learning to build a world- 
class player [DeepMind16]. That system incorporated a lot of human 
knowledge. A second system, called AlphaGo Zero, was built to learn 
Go from scratch, with no human games, experience, or other guidance, 
except the rules and a drive to win [Hassabis17]. 


The approach again used reinforcement learning, this time for both 
sides of every game. Recall that to an agent, everything else in the 
world is part of the environment. So when it’s player 1’s turn, player 
1 is the agent and makes a move based on the environment, including 
player 2. And when it’s player 2’s turn, then player 1 becomes part of 
the environment. By taking each player’s point of view alternately in 
each game, a system can play itself. 
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Since there’s no way that any practical hardware available today could 
store all the possible states in Go, it cannot use our strategy above of 
building a table that remembers the quality of every move for every 
board state. Instead, while the program learns to play, it also trains 
a neural network that takes in the board state and helps to select the 
best move. That network takes the place of a comprehensive table. 


This approach was at the heart of the training for AlphaGo Zero. After 
playing almost 5 million games against itself, AlphaGo Zero is now 
arguably the best Go player the world has ever seen [Silver17]. 


Another application of reinforcement learning has nothing to do with 
games. 


The field of computer graphics is all about making pictures using com- 
puter programs. Often these pictures appear to capture 3D scenes. In 
some cases, we want these images to be accurate predictions of the 
real world, such as when we’re planning a new building or garden. In 
other cases, we’re making more fanciful images, such as simplified or 
abstracted worlds for games or animated films. 


In all of these applications, getting an image to look like a plausible 3D 
scene involves a wide variety of techniques from diverse disciplines. 
But somewhere in that process we almost always need to determine 
the quantity and color of light that is leaving one part of the scene and 
arriving at another. It’s by following the light as it bounces around the 
environment that we can ultimately determine the light arriving at a 
simulated camera, enabling us to create the image [Glassnerg4]. 


Working out this distribution of light is a computationally intensive 
task, requiring us to consider not just the properties of the media and 
surfaces in the scene, but where things are in space. For example, if 
there’s an completely opaque object between two points, then no light 
can pass directly from one to the other (well, unless the object is wear- 
ing an invisibility cloak [Vandervelde16]). We typically must evaluate 
many millions of these light transfers to make an image. Often we 
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model this transfer of light by imagining a ray of light from one point 
to another, and computing a description of the light flowing along that 
ray. This technique is called ray tracing [Glassner89]. 


Generally speaking, the more of these rays we evaluate, the better our 
pictures become (they look smoother and more like photos, rather 
than covered in snow-like speckles). But because evaluating this move- 
ment of light is so time-consuming, we always want to perform this 
step as few times as possible. And we'd like to prioritize these calcula- 
tions so we create rays that look in directions that are most important 
to estimating the incoming light. This usually means we want to look 
in those directions where the arriving light is brightest, or carries the 
most color information. 


A seminal paper in computer graphics demonstrated that the math- 
ematics developed by physicists to describe the motion of subatomic 
particles could also be used to describe the way light moves around an 
environment [Kajiya86]. 


Remarkably, this equation for computer graphics has almost the same 
structure as the basic equation that describes Q-learning [Dahm17a]. 
This is more than a coincidence. It suggests that there are deep con- 
ceptual similarities between these two seemingly unrelated activities. 


Placing the two equations side by side, we can work out the correspon- 
dence between reinforcement learning ideas and computer graphics 

ideas. A state in Q-learning corresponds to a chosen point in the scene, 
and an action is the process of computing the light flowing along a ray 

from some other specific point back to our chosen point. The reward is 

the description of that received light. 


So now when we create an image, we can focus our efforts on finding 
the light that is most important for this picture. This use of Q-learning 
makes our process more efficient, since we’re not wasting time eval- 
uating incoming light that won’t make a big difference in this image. 
Rather than making the best move from a given board state to get the 
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best reward, we get to look in the best direction from a given point in 
the scene to get the most information about the light arriving at that 
point. 


There is more work to be done once we've collected the incoming light, 
but this gathering of light information is the essential first step in cre- 
ating an image. Because evaluating this light is such a slow process, 
and it must be done so many times when creating an image, increases 
in efficiency like those offered by this analogy to Q-learning let us cre- 
ate good-looking images significantly faster than before [Dahm17b]. 
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Generative 
Adversarial 
Networks 


One way to teach a system about 

the character of a data set is to pair it 

with another system that’s trying to trick it. 

When our learner can distinguish real data from 
forgeries, we can use it to make more data like the input. 
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27.1 Why This Chapter Is Here 


Generating data is exciting. It lets us produce new paintings, songs, 
and sculptures that have a resemblance to their inputs. 


We saw how to generate new data with a VAE in Chapter 25. In this 
chapter we’re going to look at a completely different approach to gen- 
erating data that is like the training data. 


This technique lets us generate types of data that go far beyond what 
a VAE offers. One fun application lets us create specific types of new 
media based on existing examples of different, but similar, media. For 
example, we can answer the questions, “What if van Gogh had painted 

his version of this photograph of the Grand Canyon?” [Zhu17], or “What 
if Bach had written 1960’s pop music?” [Bayless93], or “What if psy- 
chedelic rock groups performed their songs as reggae?” [EasyStarso3 | 


These are interesting questions, but they have all been well answered 
by humans, who infused their replies with greater depth and wit than 
a purely statistical forgery would offer. 


But if we can generate new data at will, it can help us train other neural 
networks. We’ve seen that big networks require big data, and getting 
high-quality, labeled data is difficult. If we’re generating that data our- 
selves, we can make as much of it as we like. 


Another use of generating new data is to give us ideas and options. 
Suppose we're planning a formal garden. We can give the computer 
the space we have available, and its location. From this, the computer 
can look up the local climate, giving it a rough idea of the types of sun 
and rain to be expected. Now we can give it pictures of other formal 
gardens that we like, and it can produce an endless parade of possible 
gardens. We might not love any one of them, but they could stimulate 
new ideas and serve as starting places for our own creativity. 


1560 


Chapter 27: Generative Adversarial Networks 


Or suppose we want help selecting new furniture for our house. A 
trained system could be given pictures of the furniture currently in our 
home, and then it could synthesize a new chair or couch that would 
fit our decor. That design could then be given to a craftsman to start 
from. Note that this isn’t merely a selection of an existing photo, but 
an entirely new piece of furniture synthetically created to aesthetically 
match to the other pieces. 


The type of system we'll look at is called a Generative Adversarial 
Network, or GAN. It’s based on a clever strategy where two different 
deep networks are pitted against one another, with the goal of getting 
one network to create new samples that are different from the training 
data, but are so close that the other network can’t tell which are syn- 
thetic and which belong to the original training set. 


Once we've trained a GAN on a dataset, whether it’s on pictures 
of flower gardens, pieces of music, machine parts, novels, or more 
abstract data, we can then make as much new data as we like. In the 
purest form of the idea, the new samples will ideally be indistinguish- 
able from the training data. That is, given any sample, we won't be 
able to tell if it was one of the input samples or something created by 
the generator. 


In many cases, we can even smoothly blend from one sample to another, 
creating in-between samples that (in the best case) are also indistin- 
guishable from the input data. 
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27.2 A Metaphor: Forging Money 


The usual way to introduce a GAN is to imagine a counterfeiting oper- 
ation. We'll present a variation on the typical presentation to better 
expose the key ideas. 


The story begins with two conspirators, whom we'll call Glenn and 
Dawn. Glenn’s name starts with G because he'll be playing the role of 
the generator, in this case forging new money. Dawn’s name begins 
with D because she'll be our discriminator, tasked with determining 
whether any given bill is real, or one of Glenn’s forgeries. They’re going 
to work together so that they both become as good as possible at their 
jobs, in the process forcing each other to become better and better. 


As the generator, Glenn sits in a back room all day, meticulously 
creating metal plates and printing false currency. Dawn is the quali- 
ty-control half of the operation. It’s her job to take a mixed-up pile of 
real bills along with Glenn’s forgeries, and decide which is which. 


The penalty for forgeries in their country is life in prison, so they’re 
both highly motivated to produce bills that nobody can tell apart from 
the real thing. Let’s say that the currency of their country is called the 
Solar, and they want to counterfeit the 10,000 Solar bill. 


An important thing to note is that all 10,000 Solar bills are not the 
same. At the very least, each bill has a unique serial number. But real 
bills are also scuffed, folded, drawn on, torn, dirtied, and otherwise 
handled. Since new, crisp bills stand out, Glenn and Dawn want to 
produce currency that looks just like all this other, worn currency, so it 
blends in and doesn’t catch anyone’s eye. 


In a real situation, Glenn and Dawn would surely start off with a huge 
stack of real bills, and pore over every detail, learning everything they 
could. But we’re just using their operation as a metaphor, so we're 
going to put in some restrictions to make this situation better match 
the algorithms this chapter is dedicated to. 
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First, we'll simplify things a little and say that we only care about one 
side of the bill. Maybe the back is blank, so people can keep shopping 
lists and other notes on their money. 


Second, we’re not going to give Glenn and Dawn each a stack of bills 
to study before they begin. In fact, we'll give them almost nothing. So 
when they start producing counterfeit bills, neither Dawn nor Glenn 
have any idea what a real bill looks like. Clearly this is going to make 
things a lot harder. We'll justify this in a moment. The one thing we 
will give them is for Glenn: a big stack of blank rectangles of paper that 
are the right shape and size for a 10,000 Solar bill. 


We will give them both a daily routine. Each day, Glenn will sit down 
and make a few forgeries, using all the information he has so far. In 
the beginning, he doesn’t know anything, so he might just splash dif- 
ferent colors of inks around on the paper. Or maybe he'll draw some 
faces or numbers. He’s basically just drawing random stuff. 


Near the end of each day, Dawn will go to the bank and withdraw a 
stack of 10,000 Solar notes. Very lightly she'll write the word “Real” 
on the back of each one in pencil. Then she'll collect Glenn’s forgeries 
for the day, and she'll write the word “Fake” lightly on the back of each. 
Then she'll shuffle them together. Figure 27.1 shows the idea. 
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Figure 27.1: Dawn’s work in detecting forgeries. Her first step is to get a 
set of real bills from the bank. On the back of each she lightly writes the 
word “real.” Then she collects the days’ work from her forging partner, 
Glenn. On the backs of those she lightly writes “fake.” Then she shuffles 
the bills together, and looking only at the front side, classifies each one 
as “real” or “fake.” In this example, she’s made some mistakes in both 
categories. 


Now Dawn does her real work. One by one she goes through the bills, 
and without looking at the backs, categorizes each one as real or fake. 
Let’s say she’s asking herself, “Is this bill real?” Then an answer of “yes” 
can be called a positive response to that bill, and an answer of no would 
be a negative response to that bill. 


Dawn carefully sorts her starting stack into two piles: the reals and the 
fakes. Since each bill could be real or fake, there are four possibilities, 
summarized in Figure 27.2. 
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Actual 


status 
Real False 


True Positive False Positive 


A successful forgery. 
Dawn studies it to 
find any errors. 


Real A real bill that was 
correctly recognized. 


Dawn’s 
decision 
False Negative True Negative 
False | An unrecognized real A failed forgery. Glenn 


bill. Dawn studies it to studies it to discover 
learn about real bills. his mistake. 





Figure 27.2: When Dawn examines a bill, it might be real or fake, and she 
might declare it to be real or fake. This gives us four combinations. 


When Dawn looks at a bill, if it’s real and she says it’s real, then her 
“positive” decision (it’s real) is accurate, and we have a true positive 
(TP). If the bill is real but her decision is “negative” (she thinks it’s 
fake), then it’s a false negative (FN). If the bill is fake but she thinks 
it’s real, that’s a false positive (FP). Finally, if it’s fake and she cor- 
rectly identifies it as fake, that’s a true negative (TN). In all cases but 
true positive, either Dawn or Glenn uses that example to improve their 
work. 


Figure 27.2 looks a lot like the operant conditioning matrix we saw in 
Chapter 11. What’s interesting here is that there is no reward for the 
true positive. Instead, Dawn is “punished” for assigning false positives 
and false negatives, by making her study her incorrect bills, and Glenn 
is “punished” for producing bills that are evaluated as true negatives 
by making him study the detected bills in order to improve his work. 


Once Dawn has categorized each bill, she goes back through the new 
piles and checks her work. 
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Let’s say she decides to first go through the stack of bills that she cat- 
egorized as “real.” If the back of one of these bills says “real,” then it’s 
a true positive, because Dawn classified it as positive (she answered 
“yes” to “Is this bill real?”), and she got it right. She can feel good about 
her skills and move on to the next bill. 


Otherwise, it’s a false positive, because she called it real when it 
wasn't. Each false positive is a chance for her to learn more about these 
bills. Since Dawn is the quality inspector, she needs to figure out what 
clue she missed that would have let her detect the forgery. It’s essen- 
tial for Dawn to be able to correctly spot a fake, because if she’s fooled 
and the police aren’t, the team could land in jail. 


When the pile of bills that she classified as “real” has been handled, 
she checks her decisions on the bills she classified as “fake.” If the back 
of one of these bills says “real,” then it’s a false negative, since she 
called it fake when it wasn’t. This is a problem for the team, because it 
suggests that Dawn needs to improve her ability to spot real bills from 
fake ones. 


Otherwise, the bill is a forgery and Dawn correctly identified it as such. 
It’s a true negative. This is now time for Glenn to learn what he did 
wrong that gave away the fact that the bill was a forgery, and try not to 
repeat that again. 


27.2.1 Learning from Experience 


We've said that both Dawn and Glenn need to learn from their mis- 
takes, but how? 


Let’s not answer that as if Dawn and Glenn were people, but as if they 
were two different neural networks. 


Dawn’s categorizations are the result of a neural network that classifies 
each input into one of two categories: real or fake. When the prediction 
is wrong, that network’s error function will have a large value. As we 
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saw in Chapter 18, backpropagation will use this error to start driving 
the error gradient through Dawn’s network, adjusting weights, so that 
the category is more likely to be right the next time. 


Glenn’s job is a bit more complicated. When Dawn thinks one of his 
forgeries is a real bill, his network doesn’t need to change a thing, 
because he succeeded at his job. But when Dawn catches the forgery, 
we'll communicate this to Glenn’s network by placing a big error at 
the end of his network. Now backprop goes into action on Glenn’s net- 
work, and adjusts the weights so that new outputs are less likely to be 
caught by Dawn. 


How does Glenn’s network improve, and produce fakes that fool 
Dawn? By slow, gradual progress. Glenn’s network adjusts a little bit 
after each detected forgery. Over time, the good changes will accumu- 
late and Glenn’s output will be harder and harder to distinguish from 
the real bills. 


It’s natural to consider this a ridiculously clumsy way for Glenn to learn. 
Why this blind trial and error? Why not show Glenn (or his network) 
some real bills, and tell it to learn directly from them? This worked for 
variational autoencoders, and as we saw in Chapter 24, they produced 
output that looked a lot like the input. 


We don’t do that here because there are lots of other important prob- 
lems that cannot be solved with that approach. 


For example, let’s say that Glenn is trying to forge expensive wines by 
mixing together various percentages of a wide variety of ingredients. 
Glenn might not have the technology or ability to reverse engineer the 
chemical composition of the wines he’s trying to fake. And even if he 
can, he might not know how to mix and prepare a multitude of starting 
ingredients to produce those specific results. They might require heat- 
ing, or cooling, or aging, or other processes Glenn couldn’t even guess 
at. Instead, he can try very many different combinations and gradu- 
ally discover the right choices that will fool Dawn, now an increasingly 
expert wine taster, more and more frequently. 
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Sometimes we can’t reverse engineer the training data at all. Let’s 
suppose that Glenn is trying to forge integrated circuits for a piece of 
electronics gear. The circuits he’s trying to fake are sealed in epoxy 
and other materials that make them all but impossible to pick apart. 
Instead, he has to try to cobble together circuits that perform the same 
way, without any knowledge of what he’s faking. 


In another scenario, Glenn might be trying to forge paintings in the 
style of a famous artist. He’s given photographs of various scenes, and 
he wants to produce an image of that scene that even an art expert 
would be fooled into thinking was a lost painting from the master. 
How would Glenn reverse engineer the painting style he’s faking? Of 
course he could work for a long time trying to enumerate the choices 
of brushes and colors and paint strokes and so on, as traditional forg- 
ers have done. But that requires long and meticulous study. It’s far 
easier to run a program that tries out lots of stylistic modifications to 
a photograph, and then compare those to real paintings by the master. 
If Dawn, now in a role as an art historian, can’t tell Glenn’s forgeries 
from real paintings, he’s accomplished his task. 


The difference between Glenn studying originals in order to reverse 
engineer them, versus gradually learning ways to fake those originals, 
is really just a repeat of the fundamental split between designing hand- 
tuned features and using the computer to find features for us. 


Recall from our discussion of feature engineering in Chapter 1 that early 
approaches to identify hand-written digits used features designed by 
humans to look for the two loops of an 8, or the horizontal and angled 
lines of a 7, and so on. Such lists of features were hard to build when 
the problem got complex, and rapidly got swamped with exceptions 
and variations. The purely statistical approach taken by machine learn- 
ing didn’t bother with explicitly noting features in this way. Rather, it 
just analyzed what was happening in the pixels statistically, and used 
those statistics to identify the digit. It proved to be much faster, accu- 
rate, and robust than hand-built lists of features. 
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27.2.2 Forging with Neural Networks 


Feeling like they’re overdue for a vacation, Dawn and Glenn decide to 
forge some plane tickets to an exotic paradise, and hand their jobs over 
to a pair of neural networks. 


Dawn will replace herself with a network that she'll call a discrim- 
inator. This network can use any kinds of layers that we want. The 
discriminator in our example takes an image as an input, and returns 
a single value as its output, describing whether the image is that of a 
real or fake bill. We can think of the discriminator as a binary classi- 
fier, with the two classes “real” and “counterfeit.” 


Glenn will replace himself with a network that he'll call a generator. 
This, too, can have any kind of architecture. The job of the generator 
is to produce new, counterfeit banknotes. But as we saw, real currency 
isn’t all identical, so the generator’s output will be an unlimited stream 
of new and unique images that are different from one another, but all 
indistinguishable from real bills. 


In Figure 27.2 we identified Dawn’s four types of decisions in terms 
of being true or false, and positive or negative. Let’s illustrate these in 
flowchart form, in terms of the discriminator and generator. 


Starting with the true positive, the discriminator correctly reports that 
the image of a real bill at its input is, indeed, a real bill. Since this is 
just what we want the discriminator to do in this case, there’s no learn- 
ing to be done. Figure 27.3 shows this process graphically. 
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label: Real 
Image of prediction: Real matches label? needed 


a real bill 


Figure 27.3: In the true positive (TP) case, the discriminator (D) receives 
a real bill and correctly predicts it to be real. Nothing needs to happen as 
a result. 


Next we have the false negative, when the discriminator incorrectly 
declares a real bill to be a fake. As a result, the discriminator needs to 
learn more about real bills so it doesn’t repeat this error. Figure 27.4 
shows the situation. 


label: Real 






Image of 
a real bill 


D learns more about 
real bills 


Figure 27.4: We get a false negative (FN) when the bill is real but the 
discriminator says it’s a fake. In this case, once again the discriminator 
needs to learn more about real bills so it doesn’t repeat this mistakes. 


The false positive case comes when the discriminator gets fooled by 
the generator, and declares a forged bill to be real. In that case, the dis- 
criminator needs to study the bill more carefully and find any errors or 
inaccuracies so that it won’t get fooled again. Figure 27.5 shows how 
this goes. 
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Figure 27.5: In the false positive (FP) situation, the discriminator receives 
a fake bill from the generator, but classifies it as real. This means that the 
generator has created a convincing forgery. To force the generator to get 
even better, the discriminator learns from its mistake so that this partic- 
ular forgery won't sneak through again. 


Finally, the true negative case is when the discriminator correctly iden- 


tifies a forgery. In this case, shown in Figure 27.6, the generator needs 
to learn what it did wrong and improve its output. 
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G learns what was wrong so it doesn’t get repeated 


Figure 27.6: In the true negative (TN) scenario, we give the discriminator 
a fake bill from the generator, and the discriminator correctly identifies 
it as fake. In this case, the generator learns that its output was not good 
enough, and it has to improve its forging skills. 


Note that out of these four possibilities, one of them (TP) has no effect 
on either network, two of them (FN and FP) cause the discriminator to 
improve its ability to recognize real and fake bills, and only one (TN) 
causes the generator to learn and avoid repeating mistakes. 
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27.2.3 A Learning Round 


We can use the feedback loops from the last section to drive our train- 
ing process. 


Generally, we'll repeat a set of four steps over and over. In each step, 
we'll give the discriminator either a real or fake bill, and then take the 
correct learning action based on its decision. 


First we train the discriminator, then the generator, then the discrim- 
inator again, and then the generator again. The idea is to test for each 
of the three situations where one or the other network needs to learn. 
The true negative case, where the generator learns, is repeated twice 
for reasons we'll get to in a moment. Figure 27.7 summarizes the four 
steps. 
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Random Random Random 
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Figure 27.7: The four steps of a learning round. (a) We give a real bill to 
the discriminator. If it classifies it as fake, then we have a false negative, 
and we need to teach the discriminator to better recognize real bills. (b) 
We generate a fake bill. If the discriminator labels it as a fake, the gener- 
ator has to become cleverer. (c) We generate a fake bill. If the discrimi- 
nator labels it as real, the discriminator has to learn how not to be fooled. 
(d) Step b repeated, so that both the discriminator and generator learn 
at roughly equal rates. 


First, we try to learn from false negatives. We give the discriminator a 


random bill from the dataset of real bills. If it misclassifies it as a forg- 
ery, we tell the discriminator to learn from that mistake. 
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Second, we look for true negatives. We give some random numbers to 
the generator, produce a fake bill, and hand that to the discriminator. 
If the discriminator catches the forgery, we tell the generator, which 
attempts to learn to produce a better forgery. 


Third, we look for false positives. We give a new batch of random val- 
ues to the generator and have it produce a new, fake bill, which we 
hand to the discriminator. If the discriminator is fooled and says the 
bill is real, the discriminator learns from its mistake. 


Finally, we repeat the true negative test from the second step. We give 
new random numbers to the generator, make a new fake bill, and if 
the discriminator catches the forgery, the generator learns. 


The reason for repeating the generator’s learning step twice is that 
practice has shown that the most efficient learning schedule is to 
update both networks at roughly the same rate. Since the discrimina- 
tor learns from two types of errors, while the generator learns from 
only one, we double the number of learning opportunities for the gen- 
erator, allowing them to both learn at about the same pace. 


To summarize, this process accomplishes three jobs. First, the dis- 
criminator learns to identify features that characterize a real sample. 
Second, the discriminator learns to identify features that reveal a fake 
sample. Third, the generator learns how to avoid including the fea- 
tures that the discriminator has learned to spot. We haven't said how 
the learning is carried out yet, but we'll get to that soon. 


So the discriminator gets better and better at identifying real bills and 
spotting the errors in the counterfeits, and the generator in turn gets 
better and better at finding out how to create a counterfeit that cannot 
be spotted. This pair of networks, taken together, make up a single 
GAN. 
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We can picture the two networks in a “learning battle” [Geitgey16]. As 
the discriminator gets better and better at spotting fakes, the gener- 
ator must get correspondingly better to get one through, causing the 
discriminator to get better at finding the forgery, causing the genera- 
tor to get even better at making fakes, and so on. 


The ultimate goal is to have a discriminator that is essentially as good 
as it can be, with deep and broad knowledge of every aspect of the real 
data, and a generator that can still get forgeries past the discrimina- 
tor. That tells us that the counterfeits are now different from the real 
examples, yet statistically indistinguishable from them, which was our 
goal all along. 


27.3 Why Antagonistic? 


The name “Generative Antagonistic Network” may seem strange in 
light of the description above. The two networks seem to be coopera- 
tive, not antagonistic. 


The word “antagonism” comes from looking at the situation in a slightly 
different way. Instead of the cooperation we described between Dawn 
and Glenn, we can imagine that Dawn is a detective with the police, 
and Glenn is working alone. To make the metaphor work we have to 
also imagine that there’s some way for Glenn to discover which of his 
forged bills were detected (perhaps he finds an accomplice in Dawn’s 
office who will forward this information to him). 


If we picture the forger and the detective as opposed to one another, 
then indeed they are antagonistic. This was how the subject of GANs 
was phrased in the original paper on the subject [Goodfellow14]. The 
antagonistic view doesn’t change anything about how we set up or 
train the networks, but it offers a different way to think about them 
[Goodfellow16]. 
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The word “antagonistic” in turn came to that paper from a branch of 
mathematics called game theory [Watson13]. The field is well named, 
because it shows us how to make a formal and systematic study of 
games. Although the field can describe conventional games like chess 
and baseball, it actually embraces a wide variety of rule-based conflict, 
from international relations to sharing resources in communal living. 
Any time we have one or more parties that are opposed in some way, 
and which are following a set of well-described rules, we have a can- 
didate for game theory. Professional poker players, economists, and 
political scientists all use game theory. 


We can look at the antagonistic conflict between the detective and 
forger as a kind of game. If the forger gets away with passing his money 
as real, he wins. If the detective catches the forgeries, she wins. Viewed 
this way, we can use the mathematical tools of game theory to get a 
deeper insight into how and why GANs work [Goodfellow14]. 


An interesting thing to note about games like our forger-detective con- 
flict is that the two players are sharing the game and influencing each 
other, but each player is responsible for their own decisions. In a prac- 
tical sense, when we train the discriminator with fake samples, only 
the discriminator learns during that step, despite the fact that we rely 
on the generator to produce the fake data. On the other hand, to train 
the generator we rely on the discriminator’s decisions, so that it can 
learn how it got caught by the discriminator and avoid repeating that 
error. 


27.4 Implementing GANs 


We'll build a GAN out of multiple models. In this context the word 
“model” refers to both a learning architecture (in this case, a set of lay- 
ers), and the weights that the system has learned from training. 
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We typically build a GAN by making one model for the generator, one 
model for the discriminator, and then a third model for the two of 
them hooked together. This third model uses the very same layers as 
in the two other models. That is, it doesn’t create new layers with the 
same form, but it uses the very same layers. This means that when we 
update the weights in the third model, those changes are automatically 
part of the other two models, and vice-versa. 


Let’s see this in action, and then look at some results. 


27.4.1 The Discriminator 


The discriminator is the simplest of the three models, as shown in 
Figure 27.8. It takes a sample as input, and its output is a single value 
that reports the network’s confidence that the input is from the train- 
ing set, rather than being an attempted forgery. 


confidence that 
sample is real 


Discriminator 





sample 


Figure 27.8: The block diagram of a discriminator. 


There aren’t any other restrictions on how we make the discriminator. 
It can be shallow or deep, and use any kinds of layers: fully connected 
layers of neurons, convolutional layers, recurrent layers, etc. 


In our forging example, the input would be an image of a bill, and the 
output a real number reflecting the network’s decision. A value of 1 
means it’s a real bill, and a value of 0 means it’s a fake. A value of 0.5 
means that the discriminator just can’t tell either way. 
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27.4.2 The Generator 


The generator takes as input a bunch of random numbers. If we build 
our generator to be deterministic, then the same input will always pro- 
duce the same output. In that sense we can think of the input values 
as latent variables. But here the latent variables weren’t discovered by 
analyzing the input, as they were for the VAE in Chapter 24. Instead, 
they just represent a version of the space, or cloud, that contains our 
sample set. The generator uses these values to create a sample corre- 
sponding to that point. 


The output of the generator is a synthetic sample. The block diagram 
is in Figure 27.9. 


sample 


Generator 


noise 


Figure 27.9: The block diagram for a generator. 


In our example of forging currency, the output would be an image. 


Though we're referring to the inputs as “noise” or “random values,” 
that’s only because of how we generate them. In fact, these values 
are a perfectly sensible description of what comes out of the gener- 
ator (allowing for the randomness that’s inside the generator itself). 
So we could say that the values going into the generator describe “a 
picture of pencil,” but we usually don’t know specifically which spe- 
cific instance of this type of thing is going to come out until we run the 
values through the generator. Once we have the output, then we can 
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say, for example, “These numbers represent a picture of an unsharp- 
ened, yellow #2 pencil with a slightly-used eraser, lying on its side.” 
But before we see the output, we don’t know that, so we call the values 
“random” or “noise.” 


The loss function for the generator of Figure 27.9 all by itself is irrel- 
evant, and in some implementations we might leave it off completely. 
As we'll see in the next section, we train the generator by hooking it up 
the discriminator, so the generator will learn from the loss function 
applied to that entire network. 


As with the discriminator, there aren’t any other constraints on how 
we build the generator. We can use any kinds of layers we like. 


Once our GAN is fully trained, we often discard the discriminator and 
keep the generator. After all, the discriminator’s purpose was to train 
the generator so that we could use it to make new data. 


When the generator has been disconnected from the discriminator, we 
can use the generator to make an unlimited amount of new data for us 
to use any way we like. 


27.4.3 Training the GAN 


Let’s now look at how to train our GAN. We'll expand the four steps in 
the learning round shown in Figure 27.7, to show where the updates 
get applied. 


Our first step is to look for false negatives, so we feed real bills to the 
discriminator, as in Figure 27.10. In this step, we don’t involve the 
generator at all. The error function is designed to punish the discrimi- 
nator if it reports a real bill as a fake. If that happens, the error drives a 
backpropagation step through the discriminator, updating its weights, 
so that it will get better at recognizing real bills. 
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Figure 27.10: In the false negative step, the discriminator is hooked up to 
an error function that punishes it for categorizing a real bill as a fake. If 
that happens, we use backprop to improve the discriminator so it’s less 
likely to make this error again. 


The second step looks for true negatives. In this step, we use a model 
that starts with random numbers going into the generator, as shown in 
Figure 27.11. The generator’s output is a fake bill, which is then fed to 
the discriminator. The error function is designed to have a large value 
if this fake bill is correctly identified as fake, meaning that the genera- 
tor got caught making a forgery. 
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Figure 27.11: In the true negative step, random numbers feed the gener- 
ator, which produces a fake bill. That’s then fed to the discriminator. If 
the discriminator catches it, the error is sent back through the network. 
The discriminator is not updated because it is “frozen,’ meaning that its 
weights cannot be affected. The error signal flows down to the generator, 
which is then updated with backprop as usual. 


In Figure 27.11 we’ve grayed-out the update step for the discriminator. 
That’s because the discriminator categorized this bill properly, so we 
don’t want to change its weights. We say that we freeze the network, 
which just means that we don’t update the weights. We still apply 
backprop, though, because we want to push the gradient information 
through the discriminator down to the generator. We then update the 
generator, so it can better learn to fool the discriminator. 


Now we look for false positives. We generate a fake bill and punish the 
discriminator if it classifies it as real, as in Figure 27.12. 
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Figure 27.12: In the false positive step, we give the discriminator a fake 
bill. If it classifies it as real, then we update the discriminator to better 
spot the fakes. 


Finally, we repeat the true negative step of Figure 27.11, so that both 
the discriminator and generator have 2 opportunities to get updated 
in each round of training. 


27.4.4 Playing the Game 


An interesting way to think about GANSs is to use the game theory view 
of two battling opponents. 


Some games involve competing for unbounded resources. For exam- 
ple, in a game of poker, the pot can theoretically get larger and larger 
without limit. 


In other games, the players compete for a fixed and limited pool of 
resources. For example, in a map-based game, there are only so many 
territories that can be occupied. So as players compete for resources, 
claiming them and trading them back and forth, each player’s total 
number of resources can change, but the total number of resources 
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that are available will not. This is called a zero-sum game, because 
each time a player gains a resource (and adds one to their total), the 
other player loses it (and subtracts one from their total), for an overall 
net change of zero [Chen16]. 


In zero-sum games each player can try to set things up so that the other 
player’s best move is of as little advantage as possible. This is called 
a minimax, or minmax, technique [Myerso2]. For example, let’s 
imagine a board game where two players are trying to build up terri- 
tory. On a particular turn, one player realizes that each of her available 
moves gets her about one piece of territory. But depending on which 
one she selects, her opponent can then make a move that gets 5, 10, 
or 20 pieces of territory. She wants to minimize the negative impact 
on her (that is, the maximum gain of her opponent). Anticipating her 
opponent will always make the best move available, she makes the 
move that leaves the board in a state where her opponent’s best move 
gains only 5 units. 


Our goal in training the GAN is to produce two networks that are each 
as good as they can be. In other words, we don’t end up with a “win- 
ner.” Instead, both networks have reached their peak ability given the 
other network’s abilities to thwart it. Game theorists call this state a 
Nash equilibrium, where each network is at its best configuration 
with respect to the other [Goodfellow16]. 


27.5 GANSs in Action 


Let’s build a GAN system and train it. We'll pick something very sim- 
ple so that we can draw meaningful illustrations of the process in 2D. 


Let’s picture all the samples in our training set as a cloud of points in 
some abstract space. After all, each sample is ultimately a list of num- 
bers, and we can treat those as coordinates in a space that has as many 
dimensions as there are numbers. 
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Our set of “real” samples will be points that belong to a cloud that has 
a 2D Gaussian shape. Recall from Chapter 2 that a Gaussian curve has 
a big bump in the center, so we'll expect most of our points to be near 
the bump, with fewer and fewer points as we move outwards. Each 
sample will be a single point from that distribution. For fun, let’s cen- 
ter the 2D blob at (5,5), and give it a standard deviation of 1. Figure 
27.13 shows this distribution. 





Figure 27.13: Our starting distribution is a Gaussian bump centered at 
(5,5) with a standard deviation of 1.If we select points at random from this 
distribution, about 68% of them will fall into a circle of radius 1 around 
the point (5,5). Left: The blob in 3D. Right: A circle showing the location 
of one standard deviation of the blob in 2D, and some representative 
random points drawn from this distribution. 


With this interpretation, the generator is trying to learn how to turn 
the random numbers that it’s given into points that seem to belong to 
the cloud. The goal is to do that so well that the discriminator can’t tell 
real points from synthetic ones created by the generator. 


In other words, we want the generator to take in random numbers, and 
output points that could have been the result of picking random points 
from our original Gaussian cloud centered at (5,5). 
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Given only a single point, as in Figure 27.14, it’s a challenge for the 
discriminator to say with any certainty if it’s an original sample 
drawn from our Gaussian cloud, or a synthetic sample created by the 
generator. 


2 3 4 5 6 7 8 


Figure 27.14: We have a single sample and we need to determine if it was 
drawn from the Gaussian distribution. This is a tough decision to make 
with so little information. 


We can make things easier on the discriminator by using an old friend 
from Chapter 8: the mini-batch (or often just the batch). 


Rather than run one sample at a time through the system, we'll run 
through a lot of them, often a power of two in the range 32 to 128. Given 
a whole bunch of points, it’s easier to decide if they were plucked from 
our Gaussian cloud or not. Figure 27.15 shows a few sets of points that 
the generator might produce. It’s easy to tell that these points are very 
unlikely to have been drawn from our original distribution. 
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Figure 27.15: Some sets of points that are unlikely to have been the result 
of picking random values from our starting Gaussian. 


We’d like our generator to produce points more like the right of Figure 
27.13 than any of these. And we'd like the discriminator to classify 
these sets of points as fakes, since they’re so unlikely to have been part 
of the original Gaussian data. 


Let’s build discriminator and generator networks for this problem. 
Because our original distribution (the 2D Gaussian cloud) is so simple, 
our networks can be simple also. 


A word of warning before we dig into the mechanics, though. GANs are 
known to be very finicky and sensitive. They are notoriously hard to 
train [Achlioptas17]. Minor changes in the architecture of the genera- 
tor or discriminator, or even small changes to some of the parameters 
(such as learning rates or dropout rates) can turn a practically useless 
GAN into a star performer, and vice-versa. Worse, we have to train not 
one network but two, and get them to work together, so the number 
of choices of parameters to search through and fine-tune can become 
overwhelming [Bojanowski17]. So while we develop a GAN, it’s essen- 
tial to experiment using the specific data we want to learn from. 


In the following discussion, we'll skip the many dead-ends and bad- 
ly-performing models that we tried. Instead, we'll jump right to models 
that we found worked well for this dataset. It’s very possible that the 
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architectures we'll show can be significantly improved (that is, it they 
could learn faster and more accurately) with further changes, or per- 
haps even just small tweaks in the right places. 


Let’s start with the generator, shown in Figure 27.16. 


4 2 
16 2 
Leaky No AF 
ReLU 0.1 


Figure 27.16: A simple generator. It takes in 4 random numbers and 
computes an (x,y) pair. 


The model takes in 4 random values, uniformly selected from the 
range O to 1. We start with a fully-connected layer with 16 neurons 
and a leaky ReLU activation (recall from Chapter 17 that a leaky ReLU, 
shown in Figure 27.17, is like a normal ReLU, but instead of returning 
o for negative values it scales them by a small number, in this case 0.1). 


Leaky ReLU 


=f 
-1 0 4 


Figure 27.17: A leaky ReLU that scales negative points by 0.1. 
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This is followed by another fully-connected layer with just 2 neurons 
and no activation function. And that’s it for the generator. The 2 val- 
ues that are produced are the X and Y coordinates of a point. 


We're asking quite a lot of these two layers with only 18 neurons total. 
We want them to learn how to convert a set of 4 uniformly-distributed 
random numbers into a 2D point that could have been drawn from 
our Gaussian cloud with a center at (5,5) and a standard deviation of 
1, but we'll never tell it the center or size of that cloud. We'll only tell 
it when a mini-batch of its points isn’t a credible match to that cloud, 
and leave it to the neurons to figure out where they went wrong and 
how to make it right. 


We usually want the discriminator to be more powerful than the gen- 
erator, because it needs to learn not just the real distribution, but how 
to spot fakes. Our discriminator is in Figure 27.18. 


16 16 1 


Leaky Leaky sigmoid 
ReLU 0.1 ReLU 0.1 


Figure 27.18: A simple discriminator. It contains two fully-connected 
layers of 16 neurons with leaky ReLU activation functions. The final layer 
is a fully-connected layer with 1 neuron and a sigmoid activation function. 


This is just two layers of the same form as the start of the generator: 
a pair of fully-connected layers of 16 neurons with a leaky ReLU acti- 
vation. At the end is a fully-connected layer with just 1 neuron and a 
sigmoid activation function. The output is a single number with the 
network’s confidence that the input is from the same dataset as the 
training data. 
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Finally, we put the generator and discriminator together to make a 
third model, which is sometimes referred to as the generator-discrim- 
inator, or sometimes simply the GAN itself. Figure 27.19 shows this 
combination. 


Generator Discriminator 


4 noise 


Leaky Leaky Leaky sigmoid 
ReLU 0.1 ReLU 0.1 ReLU 0.1 





Figure 27.19: Putting the generator and discriminator together gives us 
the full GAN. 


Since the generator presents an (x,y) pair at its output, and the dis- 
criminator takes an (x,y) pair at its input, the two networks go together 
perfectly. The input is a set of 4 random numbers, and the output tells 

us how likely it is that the point created by the generator is from the 

training set’s distribution. 


It’s important to keep in mind that the models marked “generator” and 

“discriminator” are not copies of our earlier models, but they are in fact 
the very same models, just connected together one after the other to 
make one big model. In other words, there’s just one generator model 
and one discriminator model. When we make the combined model of 
Figure 27.19 we just chain together those two existing models. Modern 
deep-learning libraries let us make multiple models out of shared com- 
ponents for just this kind of application. 


Using the same models in these different configurations makes sense, 
since the combined model needs to use the most up-to-date versions 
of the generator and discriminator. 
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Another important point is that when we train the generator using 
the combined model of Figure 27.19, we don’t want to train the dis- 
criminator as well. We saw this in Figure 27.11 where we grayed-out 
the discriminator during the update step. We need to run backprop 
through the discriminator, since it’s part of the network, but we only 
apply the update step to the weights in the generator. 


Remember that we want to train the discriminator and generator in 
alternating passes. If we were to apply backprop to the entire network 
of Figure 27.19, then we’d update the weights in the discriminator as 
well as the generator. Because we want to train both models at about 
the same rate, and we know we're going to train the discriminator sep- 
arately (since it also needs to be trained on real data), we want to tell 
our library to update the weights in the generator only. 


The mechanics for controlling whether or not a layer should have its 
weights updated are library specific, but generally speaking we’re able 
to freeze, lock, or disable updates on each layer. Then we can 
unfreeze, unlock, or enable updates later when we want those 
layers to be able to learn. As we discussed earlier, the backprop algo- 
rithm still runs through these layers, because we need to compute the 
gradients down into the generator. We just don’t apply the update step 
to the discriminator. 


To summarize the training process, we start with a mini-batch of points 
from the training set. We then follow the 4-stage process in Figure 27.7, 
training the discriminator and generator alternately. 


Let’s look at some results. 


To train our GAN, we first made a training set by picking 10,000 ran- 
dom points from our starting Gaussian blob. Then we trained the 
networks using mini-batches of 32 points. Running all 10,000 points 
through the system made up one epoch. 


Results for 15 epochs are shown in Figure 27.20. 
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Figure 27.20: Our simple GAN in action. Read the plots left to right, top 
to bottom. Our starting Gaussian is shown with blue points, and a blue 
circle showing its mean and standard variation. The distribution that is 
being learned by the GAN is shown in orange, with an ellipse showing 
the center and standard deviation of the mini-batch of points that were 


generated. The plots show the results after O to 10 epochs of training, 
and then epoch 13. 


We can see that the GAN’s generated points start out as a smudgy line 
in the southwest-northeast direction, roughly centered around (1,1). 
With each interaction, they move closer to the original data’s center 
and shape. Around epoch 4 the generated samples overshoot the cen- 
ter, and become increasingly elliptical rather than circular. But they 
come back and correct both qualities, until the match is looking very 
good by epoch 13. 
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Figure 27.21 shows the loss curves for the discriminator and generator. 
Ideally they would both reach a value of 0.5 and stay there. We can see 
that our very simple models did a good job of getting close to that goal. 


— generator loss 
— discriminator loss 








Figure 27.21: The loss for our GAN. They seem to meet and remain at a 
value a little above the ideal of 0.5. 


27.6 DCGANs 


We said that we could build our discriminator and generator using any 
kind of architecture we like. Our simple models made of dense layers 
performed nicely for our little 2D dataset, but if we want to work with 
images then we’d probably prefer to use convolutional layers, since as 
we saw in Chapter 21, they’re well suited to processing images. A GAN 
that’s built from convolution layers has its own acronym, DCGAN, 
standing for Deep Convolutional Generative Adversarial 
Network. 


Let’s use a DCGAN on the MNIST data we’ve seen in previous chap- 
ters. We'll use a model proposed by [Gildenblat16]. The generator and 
discriminator are shown in Figure 27.22. 
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Figure 27.22: Top: The generator of a DCGAN for MNIST. Bottom: The 
discriminator. Every processing layer uses the tanh activation function, 
with the exception of the sigmoid on the last layer for the discriminator. 
Both the discriminator and the combined generator-discriminator are 
trained with a standard binary crossentropy loss function and a Nesterov 
SGD optimizer set to a learning rate of 0.0005 and momentum of 0.9 
(model from [Gildenblat16]). 


In this network we’re using explicit downsampling (or pooling) layers 
in the discriminator, and upsampling (or expanding) layers in the gen- 
erator, because that’s how the network was originally proposed. We'll 
talk about rolling these operations into their associated convolution 


layers below. 


The second dense layer in the generator uses 6272 neurons. This num- 
ber comes about because experimentation showed that giving the first 
convolution layer a tensor with 128 channels worked well. We can see 
that the generator has a pair of 2 by 2 upsampling layers, so since we 
want an output of 28 by 28, the input to the first upsampling layer 
should be 7 by 7. So the input to this layer needs 7 x 7 x 128 = 6272 
numbers. We simply give the second dense layer this many neurons, 
and we reshape the output into a 3D tensor before going into the first 
upsampling layer. 
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We saw in Chapter 20 that batchnorm layers are designed to sit 
between a layer’s output and its activation function, so that’s how we 
set up the batchnorm in the generator. 


The discriminator follows roughly the same process as the generator, 
but in reverse. We have a couple of convolution layers, each followed 
by 2 by 2 max pooling. We reshape the output into a list by using a 
flatten layer (we could also use a reshaping layer). Then we have a cou- 
ple of dense layers, the second with a single output. 


The results of the generator after 1 epoch of training are pretty unin- 
telligible, as we might expect. Figure 27.23 shows what they look like. 
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Figure 27.23: The blotches from the generator after 1 epoch of training. 
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After 100 epochs of training, the generator produced the results of 
Figure 27.24. 
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Figure 27.24: The output of the deep convolutional GAN of Figure 27.22 
after 100 epochs of training on the MNIST dataset. 


When we step back to consider the process, this is a startling result. 
Remember that the generator has never seen the dataset. It has no 
idea what the MNIST data looks like. All it’s ever done is create ran- 
dom 28 by 28 grids of numbers, and then receive feedback that told it 
how good or bad the values in those grids were. Over time, it produced 
grids that look like digits. There are some misfires, but most of the dig- 
its are easily recognizable. 
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27.6.1 Rules of Thumb 


We mentioned earlier that GANs are very sensitive to their specific 
architecture and training variables. A famous paper investigated GANs 
based on convolutional layers, and found a few rules of thumb that 
seem to lead to good results [Radford16]. 


First, don’t use pooling layers in the discriminator to reduce the size 
of the data. This echoes the advice of [Springenberg15] that we saw in 
Chapter 21. Instead of using pooling after convolution, use a convolu- 
tion layer with striding. For instance, to reduce the size of the input’s 
width and height by a half, use 2 steps of striding in each dimension. 


This advice applies as well to the generator when enlarging the input 
samples of noise up to an entire image. Rather than using an upsam- 
pling layer that repeats the data (perhaps with interpolation) to make 
the shape larger, use transposed convolution (or fractional striding) to 
achieve the same effect. For example, to enlarge the data by a factor of 
2, we'd use astride of 2 in each dimension in a transposed-convolution 
layer. 


Also echoing [Springenberg15], use fully-connected layers only as the 
first or last layers in the network. 


Both the generator and discriminator should be trained with the Adam 
optimizer. Typical starting learning rates are 0.001 (1e-3) for the gen- 
erator, and 0.0001 (1e-4) for the discriminator. 


Next, apply batchnorm layers after each convolution in both networks, 
except the final convolution in the generator and the first convolu- 
tion in the discriminator. Remember that batchnorm should appear 
between the convolution layer’s output and its activation function. 


Finally, we should use specific activation functions in the two networks. 
In the generator, use ReLU everywhere but at the final layer, where we 
should use tanh. In the generator, use a Leaky ReLU for all layers. 
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These points are guidance, not rigid rules. They’re not all going to 
work for every network we design and every dataset we learn from. But 
experience has shown that they’re a good place to start. 


It’s interesting to note that our models of Figure 27.22 violate many of 
these rules. These networks had probably already gone through a pro- 
cess of experimentation and tuning [Gildenblat16]. We experimented 
with applying the rules of thumb above to these networks in a straight- 
forward way, and performance dropped considerably. 


One reason might be that the expanding layers in the generator come 
before the convolution layers, while transposed convolution layers 
place the upsampling after the convolution is done. Often this change 
doesn’t make a big difference, but in these networks it seemed to mat- 
ter a lot. 


When we have a tuned network already at our disposal, it makes sense 
to use it in the form that works. But when we’re making our own net- 
works from scratch, it’s best to follow these guidelines. 


27.7 Challenges 


Perhaps the biggest challenge to using GANSs is practice is their sen- 
sitivity to both structure and parameters. Playing a game of cat and 
mouse requires both parties to be closely matched at all times. If either 
the discriminator or generator gets better than the other too quickly, 
the other will never be able to catch up. As we mentioned above, get- 
ting the right combination of all of these values is essential to getting 
good performance out of a GAN, but finding that combination can 
be challenging [Arjovsky17a] [Achlioptas17]. Following the rules of 
thumb given above is generally reeommended when we're building a 
new DCGAN. 


A theoretical issue with GANSs is that we currently have no proof that 
they will converge. Recall our lone perceptron of Chapter 10, which 
finds the dividing line between two linearly-separable sets of data. We 
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can prove that the perceptron will, given enough training time, always 
find that dividing line. But when things get complicated, up to and 
including GANs, such proofs are nowhere to be found. All we can say 
is that GANs do seem to settle into good performance most of the time 
when we find the right parameters, but there’s no guarantee beyond 
that. 


27.7.1 Using Big Samples 


The basic structure of a GAN can run into trouble when we try to train 
a generator to produce large images, such as 1000 by 1000 pixels. The 
computational problem is that with all that data, it’s easy for the dis- 
criminator to tell the generated fakes from the real images. Trying to 
fix all these pixels simultaneously can lead to error gradients that cause 
the generator’s output to move in almost random directions, rather 
than getting closer to matching the inputs [Karras17]. On top of that, 
there’s the practical problem of finding enough compute power, mem- 
ory, and time to process large numbers of these big samples. Recall 
that every pixel is a feature, so every image that’s 1000 pixels on a side 
has 1 million features (or 3 million if it’s a color photo). 


Because we want our final, high resolution images to stand up to scru- 
tiny, we’re going to want to use a large training set. The time required 
to crunch through big collections of giant images is going to add up 
fast. Even fast hardware might not be able to do the job in the time we 
have available. 


A way out is to start by resizing the images in the training set into a 
variety of smaller sizes, for example 512 pixels on a side, then 128, then 
64, and so on, down to 4 pixels on a side. The build a small generator 
and discriminator, each with just a few layers of convolution. Train 
these small networks with the 4 by 4 images. When they are doing a 
great job, add a few more convolution layers to the end of each net- 
work, and now train them with 8 by 8 images. Again, when the results 
are good, add some more convolution layers to the end of each net- 
work and train them on 16 by 16 images. 
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In this way, the generator and discriminator are able to build on their 
training as they grow. This means that by the time we work our way 
up to full size images that are 1024 pixels on a side, we already have a 
GAN that can do a great job at generating and discriminating images 
that are 512 pixels on a side. We won't have to do too much additional 
training with the larger images until the system performs well with 
them, too. This process takes much less time to complete than if we’d 
trained with only the full-sized images from the start [Karras17]. 


27.7.2 Modal Collapse 


Another problem is specific to GANs. Let’s suppose that we’re trying to 
train our GAN to produce pictures of cats. Suppose that the generator 
manages to find one cat image that the discriminator accepts as real. 
A sneaky generator could then just produce that image every time. No 
matter what values we use for the noise inputs, we always get back that 
one image. The discriminator tells us that every image it gets is plausi- 
bly real, so the generator has accomplished its goal and stops learning. 


This is another example of neural networks finding sneaky solutions 
to the things we want them to learn. The generator has accomplished 
exactly what we asked for, since it can turn random numbers into 
brand-new samples that the discriminator cannot tell apart from real 
samples. The problem is that every sample made by the generator is 
identical. It did what we asked for, which wasn’t quite what we wanted. 


This problem of producing just one successful output over and over 
is called modal collapse (note that the first word is “modal,” pro- 
nounced “mode’-ull”, referring to a mode, or a way of working, and not 
“model”). If the generator settles into just a single sample (in this case, 
a single picture of a cat), the situation is described as full modal col- 
lapse (or the Helvetica scenario [Popper12]). Much more common 
is when the system produces the same few outputs, or minor varia- 
tions of them. This situation is called partial modal collapse. 
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Figure 27.25 shows one run of our DCGAN after 3 epochs of training 
using some poorly-chosen parameters. It’s pretty clear that the system 
is collapsing towards a mode where it’s going to output some kind of 1 
a lot more than anything else. 
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Figure 27.25: After only 3 epochs of training, this DCGAN is showing clear 
signs of modal collapse. 


There are schemes for addressing this problem, but currently the best 
recommendation begins with using mini-batches of data, as we did 
above. Then the discriminator’s loss function can be extended with 
some additional terms to measure the diversity of the outputs pro- 
duced in that mini-batch. If the outputs fall into a few groups where 
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they’re all the same, or nearly the same, the discriminator can assign 
a larger error to the result. The generator will diversify because that 
action will reduce the error [Arjovsky17b]. 
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Creative 
Applications 


Let’s have some fun with our 

deep learners! We'll use our systems 
to create some cool images, and even 
generate some more text for this book. 
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28.1 Why This Chapter Is Here 


We've reached the end of a long book, so let’s relax and have some fun. 
In this chapter we'll look at some creative ways to use neural networks 
to create art. 


28.2 Visualizing Filters 


In Chapter 21 we made images, or visualizations, of the filters in a con- 
volutional neural network. 


We're going to use variations on that technique for two techniques in 
this chapter. To prepare for that, let’s review the process, but in a little 
more detail than before. Then we'll be in good shape to modify it to 
make art. 


28.2.1 Picking A Network 


To visualize a filter, we need to know which network it comes from. 
For most of the projects in this chapter, we'll use the VGG16 network 
[Simonyan14], though we could substitute just about any trained CNN. 
As we discussed in Chapter 21, VGG16 is made up of 5 blocks of convo- 
lution layers, with downsampling between blocks. Figure 28.1 recaps 
the schematic for VGG16. 
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pe — | 
Input: 
1 64x (8x3) 2x2, 1 128 x (8x3) 2x2, 1 256 x (8x3) 2x2, 
ReLU stride 2x2 ReLU stride 2x2 ReLU stride 2x2 


2 times 2 times 3 times 





a. 








1 512 x (8x3) 2X2, 1 512 x (3x3) 2X2, 4096 0.5 4096 0.5 1000 
ReLU stride 2x2 ReLU stride 2x2 ReLU ReLU softmax 
3 times 3 times 


Figure 28.1: The VGG16 network. Each block is repeated either 2 or 3 
times in a row, as marked. 


To simplify our diagrams going forward, we won’t draw the zero-pad- 
ding layers before each convolution layer. We'll also drop all the labels 

that are consistent across the network. Those include the ReLU acti- 
vation functions on the convolution layers, the 3 by 3 size of the filters, 
and the 2 by 2 size and stride of the max-pooling layers. Finally, we'll 

drop everything after block 5. That’s because we're not training this 

network, and we won't care about its final predictions. We'll just want 

to feed in an input and see how the filters in the convolution layers 

react. Our simplified diagram of VGG16 is shown in Figure 28.2. Note 

that we’re not omitting anything from the actual network, we’re merely 

simplifying our diagrams. 





block 1 


























Figure 28.2: A simplified diagram of the full VGG16 architecture in Figure 
28.1. In this drawing, we've left off the zero-padding layers, the activation 
functions and filter sizes, the labels on the pooling steps, and everything 
after the end of block 5. 


1606 


Chapter 28: Creative Applications 


VGG16 was trained on the ImageNet ILSVRC-2014 database 
[ImageNet14]. That database contained about a half-million photo- 
graphs, manually labeled into 1000 categories. The photos included 
lots of animals, as well as everyday objects. 


The trained VGG16 network, with all of its weights, is widely avail- 


able online. Easy access to VGG16 is built into many machine learning 
libraries, as we saw in Chapter 24. 


28.2.2 Visualizing One Filter 


Let’s pick out one filter from VGG16 to visualize. We'll call this the 
“target filter.” 


To get started, we'll create an image that’s filled with random noise. 
We'll use values from -1 to 1. So each of the three color entries for 
every pixel is initially assigned a random number, usually drawn from 
a uniform distribution (as discussed in Chapter 2). Let’s call that our 
“noise image.” Figure 28.3 shows an example. 


Noise image 





Figure 28.3: At the far left, a color image made of uniform noise. The 
pixel values have been moved and scaled to the range [0, 255] for display. 
The images to its right show the three color channels separately. The 
image is 224 by 224, as expected by VGG16. 


We'll now run the noise image through our convnet, as in Figure 28.4. 


1607 


Chapter 28: Creative Applications 





VGG16 











block 1 block 2 block 5 


















os os 12 am ‘inh 
224 224 112 "498 


64 64 


Figure 28.4: Running a noise image into VGG16. The output of each 
convolution layer is a tensor with 1 slice per filter on that layer. 


The output of each convolution layer is a 3D tensor. The width and 
height are the same as the width and height of that layer’s input, and 
there is one channel for each filter in the layer. Recall from Chapter 
21 that the output of each filter is called its activation map, so each 
channel in the output tensor is the activation map of one filter. 


In Figure 28.4, the input is 224 by 224, with 3 channels. So the output 
of the first convolution layer, which has 64 filters, is 224 by 224 with 
64 channels. The output of the second convolution layer has the same 
shape. 


The output of the second layer in block 1 goes into a max-pooling step 
which reduces the size of the input’s width and height by 2. Thus the 
input to the first convolution layer in the second block has the shape 
112 by 112 by 64. Since that layer has 128 filters, its output is 112 by 112 
by 128. And so it goes until the final layer, which produces an output 
of 14 by 14 with 512 channels. 


To visualize the target filter, we’ll ignore all of these outputs except for 
the single activation map corresponding to the filter we want. Let’s say 
we want to look at the activation map of filter number 17 in the first 
convolution layer of the second block, as in Figure 28.5. 
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Figure 28.5: We run our noisy input into VGG16, and then look at a single 
filter somewhere in the network. We get the activation map for that filter 
and add all of the values there together. We use that as the loss function 
during backpropagation (shown as a dashed line) to modify the inputs to 
stimulate that filter even more. 


To get a feeling for how strongly this filter is responding to the input, 
we'll just add up all the values in the activation map to come up with a 
single number. In actual code, we usually multiply each value by itself 
before we add them all together. This step has a number of benefits, 
such as making sure that our values are always adding up positive 
numbers. There are implementation details like this associated with 
many of the algorithms we'll see in this chapter, but since these details 
don’t promote our understanding of the algorithms themselves, we’ll 
usually leave them out. 


Let’s return to seeing how strongly the filter is responding by adding 
up its activation map values. The more places in the input where the 
filter found a good match, the larger the activation map values will be, 
and thus the larger their sum will be. We say that this sum tells us 
how strongly the filter was stimulated. We want this number to be as 
large as possible. 
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We normally compute a network’s error, or loss, by looking at the out- 
put of the last layer and finding how far it is from the right answer. We 
then try to minimize that value. But in this case, we’re deriving our 
measurement from one of the internal layers, and our goal is to make 
that value as large as possible. To avoid inventing new terms we retain 
our usual language and refer to our measured value (coming out of the 
plus sign in Figure 28.5) as the error or loss, though those names don’t 
quite fit our intention to make that value as large as possible. 


When we run backpropagation in this situation, we don’t update the 
network. We perform backprop, but we skip the update step, so the 
weights are not changed. It’s essential to keep in mind that we don’t 
change any weights. After all, we’re not interested in teaching the net- 
work anything at this point. Instead, we'll push the error gradient all 
the way through until it reaches the input layer. This gives us a gra- 
dient for every pixel, telling us how to adjust it to make the loss we’re 
measuring larger or smaller. 


We want to make the loss larger, so the filter is even more stimulated, 
because we want to create an image that stimulates the filter as much 
as possible. So instead of following the gradient “downhill” to produce 
a smaller loss value, we follow it “uphill” to produce a larger loss value. 
In other words, we adjust each pixel according to its gradient to stimu- 
late the filter even more. We call this gradient ascent. 


28.2.3 Visualizing One Layer 


We can generalize this method to visualize how much we're stimulat- 
ing all of the filters on a given layer. 


We'll just look at each activation map coming out of that layer, add 
up its values, and then add up the values from all the maps. In other 
words, we just add together all the values in the tensor coming out of 
the layer. That gives us one number that represents how strongly the 
filters on that layer, taken together, are responding to the input. 


Figure 28.6 shows the idea visually. 
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Figure 28.6: Rather than add up the response of just a single filter, we 
can add up the responses of all the filters in a given layer, and use that as 
our loss, encouraging the input to better stimulate this layer as a whole. 


Running the loss of the whole layer back through the network with 
backprop will cause the pixels in our noisy image to change so that 
they stimulate the filters on that layer even more. 


Let’s try it out. Figure 28.7 shows the results for each of the convolu- 
tion layers in VGG16. 
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Figure 28.7: Visualizing entire layers. These images are the result of 
running Figure 28.6 for each layer in VGG16. Since they started with 
random noise, each time we generate these images we'll get a different, 
but similar, result. 
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The second and third filters in block 4 seem to show a lot of concentric 
circles, pieces of spirals, and what look like pieces of tubes made out of 
partial circles. We'll see these structures again soon. 


A variation on the algorithm we just discussed would allow us to say 
that some filters are more important to use than others. Rather than 
just adding all the filter activations together, we could scale each fil- 
ter’s output by some value before adding it into the total. Each pixel 
would be pushed most strongly towards the filters with the greatest 
scaling factors. For now, when we want to find the activation of a layer, 
we'll stick with adding together all of its filters without that scaling, so 
they’re all equally important. 


23.3 Deep Dreaming 


Now that we can adjust our pixels to better stimulate one layer, let’s 
adjust them to stimulate multiple layers. This lets us create some wild 
looking images. 


To make these images, we'll make two changes to Figure 28.6. 


First, we'll add up the results from multiple layers, not just one. We'll 
scale each layer’s output by an associated scaling factor, and then add 
those scaled responses together. These scaling factors are hyperpa- 
rameters that we set before we start making an image. 


Our second change to Figure 28.6 is to replace our noisy input with an 
image of our choice. Our running example will be the frog in Figure 
28.8. 
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Figure 28.8: A photograph of The Frog. 


To find the multi-layer loss for this frog, we need only pick some 
layers and weights and run the process we just covered. Figure 28.9 
shows the idea for three layers. 
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Figure 28.9: The Deep Dream algorithm uses a loss built from multiple 
layers. We find the activation maps for all the filters on each of a chosen 
set of layers, add up their values, weight those by values of our choosing, 
add those weighted sums together, and that’s our loss. The image will 
be modified to try to stimulate all of our chosen filters, with priority to 
those with the largest weights. 
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The original name for this technique was inceptionism 
[Mordvintsev15], but it has come to be more frequently known as 
deep dreaming. The name is a poetic suggestion that the convnet is 


“dreaming” about the original image, and the image we get back shows 
us where the network’s dream went. 


As with the noisy image that we used for filter and layer visualiza- 
tion, any clumps of pixels in our starting image that happen to cause a 
response in the layers we picked out to use will be adjusted to increase 
that response. So the pixels gradually change value to stimulate the 
filters on our selected layers more and more strongly. The resulting 
modifications to the starting image often have a psychedelic look. 


Figure 28.10 shows some “dreams” from our frog image. 
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Figure 28.10: Some deep dreams results from the starting frog image 
(in the upper-left corner). These images are entirely algorithmic, and the 
result of applying the algorithm in Figure 28.9 using VGG16 and different 
combinations of layers and weights. 


Sometimes our dreams enhance the original image in a clear way. 
Figure 28.11 shows the results from an image of a dog. 
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Figure 28.11: Deep dreaming, starting from the dog in the upper left. 


Sometimes the results can be surrealistic, as in Figure 28.12. 
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Figure 28.12: Deep dreaming, starting with the cat in the upper left. 


If we crank up the weights, or let the system run a long time, we can 
get some extreme results. Figure 28.13 shows some examples. 





Figure 28.13: Deep dreaming, starting with the cat in the upper left of 
Figure 28.12. These images aren’t much like the original cat, but they have 
their own personalities. 
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Many variations on the basic algorithm described above have been 
explored [Tyka15], but the surface has only been scratched. We can 
imagine schemes to automatically determine the weights on the lay- 
ers, or even apply different weights to the individual filters on each 
layer. We can “mask” the activation maps before we add them up, so 
that some areas (like the background) are ignored, or we can mask the 
updates to the pixels so that some pixels in the original image are not 
changed at all in response to one set of layer outputs, but are allowed 
to change a lot in response to some other set of layer outputs. We could 
even apply different combinations of layers and weights to different 
regions of the input image. 


There’s no “right” or “best” way to do deep dreaming. It’s a creative 
exercise where we follow our aesthetics, hunches, or wild guesses to 
hunt for images that appeal to us. It can be hard to predict what’s going 
to come out from any particular combination of layers and weights, so 
the process rewards patience and a lot of experimenting. In this sense, 
it’s a lot like looking for pretty fractals [Beddard11] [Ragets15]. 


One result of using pre-trained networks is that we see echoes of 
their training data in our dreamed images. For example, the circles 
and tubes from Figure 28.7 are easy to see in our dreaming results, as 
are complete eyes (presumably because so many of VGG16’s training 
images were animals with eyes). If we train a new network with pic- 
tures of, say, office supplies, then we should expect to see fragments of 
staplers and tape dispensers in our enhanced images. 


The deep dreaming approach to making art has lots of room left for 
new discoveries. 


The code we used to create the generated images in this section was 
adapted from [Bonaccorso17]. 
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28.4 Neural Style Transfer 


We can use filter and layer responses in another way to do something 
remarkable: transfer one artist’s style of painting onto another image. 
This process is called neural style transfer. 


Cultures often celebrate the idiosyncratic visual style of artists. Let’s 
focus on paintings. What characterizes the style of a painting? 


That’s a big question, because “style” can include someone’s world 
view, which influences choices as diverse as their subject matter, com- 
position, materials, and tools. Let’s focus strictly on visual appearance. 
Even narrowed down this way, it’s hard to precisely identify what “style” 
means for a painting, but we might say that it refers to how colors and 
shapes are used to create forms, and the types and distributions of 
those forms across the canvas [ArtStory17] [Wikipedia17]. 


Rather than try to refine this description, let’s see if we can find some- 
thing that seems like it’s in the ballpark, while also being something 
we can formalize in terms of the layers and filters of a deep convolu- 
tional network. 


Our goal in this section is to take a picture we’d like to modify, called 
the base image, and a second picture whose style we’d like to match, 
called the style reference. For example, our frog could be our base 
image, and any painting could be the style reference. We'll use these to 
create a new image, called the generated image, which has the con- 
tent of the base image, expressed in the style of the style reference. 


To get started, we'll make an assertion that may sound crazy. We'll say 
that we can characterize the style of a painting by looking at the layer 
activations it produces. 
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This idea is due to a seminal paper published in 2015 [Gatys15]. What 
makes it work is that we don’t just use the activation maps as they come 
out of the convolution layers, but instead we process them in a partic- 
ular way, producing a 2D table from each output. It’s those tables that 
hold the representation of the style. 


Because these tables are so critical to the algorithm, let’s see how they 
get made. 


28.4.1 Capturing Style in a Matrix 


Let’s imagine a convolution layer that takes in an input that is 8 by 8. 
We'll say the layer has 6 filters in all, so its output tensor will be 8 by 8 
by 6. 


We'll build a single 2D grid, or table, to represent what’s happening in 
this layer. 


To build our table, we consider each pair of activation maps (that is, 
each pair of channels in the output tensor). First we'll look at maps o 
and 1, then o and 2, and so on, then 1 and 2, then 1 and 3, all the way 
up to maps 4 and 5. 


We'll take each pair of maps and multiply them together, element by 
element, and then add up the result. The number that results from 
this step goes into the position in our table with coordinates given by 
the filter numbers. The table we build up this way is called a Gram 
matrix. 


Let’s see an example. Figure 28.14 shows the outputs of filter 1 and fil- 
ter 2, and the result of multiplying them together. Adding up all those 
multiplied values gave us 10.90, so that’s the value that went into the 
table at location (1,2). Because the order of the maps doesn’t matter in 
this operation, the same result would come from Filter 2 and Filter 1, 
so we put the same value we just computed at entry (2,1) as well. 


1621 


Chapter 28: Creative Applications 


Filter1 map (0.00-1.00) Filter2 map (0.02-1.00) |= Product (0.00-0.90) Gram[1,2]=10.90 









































Figure 28.14: The two images on the left show the responses of imaginary 
filters 1 and 2 in a convolution layer with 6 filters in all. In this diagram, 
and the ones to come, the responses are all positive. The image to their 
right is the result of multiplying corresponding elements together. Adding 
those results together gives us 10.90. That number is placed in the Gram 
matrix at (1,2) and (2,1). Because each grid in this diagram has a different 
range of values (shown at the top of each grid), each of the left three 
grids in this figure (and those to follow) has been independently scaled 
for display from blue for O to purple for 1. 


In Figure 28.14, the filter activation maps are 8 by 8, matching the 
width and height of the layer’s input. The table we produce at the far 
right is 6 by 6, because there are 6 filters in this layer. 


The responses of filters 1 and 2 overlapped a lot in Figure 28.14, so we 
got a large sum for their multiplied pixels. Let’s compare the activation 
map of filter 1 with the map from another imaginary filter, number 3. 
This time they don’t overlap much, so the sum of their multiplied acti- 
vations is only 1.36, as shown in Figure 28.15. 


Filter1 map (0.00-1.00) Filter3 map (0.00-1.00) Product (0.00-0.16) Gram[1,3]=1.36 









































Figure 28.15: Finding the Gram matrix entries for the activation maps of 
filters 1 and 3. Filter 1 has the same response as before. These responses 
overlap much less than 1 and 2, so their summed value is much smaller. 
That value goes into (1,3) and (3,1). 
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Let’s take one more pair. Filter 4 has found 3 different matches in 
the input, while filter 5 found a few in the bottom right. Figure 28.16 
repeats the steps above for these two filters. 


Filter4 map (0.00-1.00) Filter5 map (0.02-1.00) |= Product (0.00-0.61) Gram[4,5]=8.61 






































= 


Figure 28.16: Finding the Gram matrix entries for the responses of filters 
4 and 5.Where both activation maps have large values, the corresponding 
value in the third grid is also large. The sum of the values in that third 
grid are saved at (4,5) and (5,4). 





These Gram matrices are the 2D tables we referred to earlier which 
hold the style of the image that generated our filter responses. 


Why should this be? Why does this recipe for creating 2D tables from 
activation maps have anything to do with the style of an image? We'll 
come back to this later. 


28.4.2 The Big Picture 


Let’s recap our plan. 


Our overall goal will be to create a generated image that looks like 
the base image, but has the same style as our style reference. 


We'll do this the same way we visualized filters and layers. We'll start 
with an image full of noise. We use noise because it makes no assump- 
tions about what the output should look like. We'll discover that as we 


go. 
The noisy image serves as our first draft of the generated image. We'll 


gradually refine this noisy image using backprop. Each time we run 
the generated image through the network, we'll compute an error, and 
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then modify our generated image to become more like what we’re hop- 
ing for. Thus the noise will gradually transform into a version of our 
base image, but in the style of the style reference. 


In deep dreaming we wanted to maximize the error we measured, 
because we wanted to stimulate the filters on selected layers. 


In style transfer, we want to go back to the more usual procedure and 
minimize the network’s error. That’s because now the loss will tell us 
two things. First, the content loss tells us how much our generated 
image is not like the base image in content. Second, the style loss 
tells us how much our generated image is not like the style in the style 
reference. Our error will be these two values added together. Since 
we want our generated image to match both content and style, we'll 
change the pixel values to minimize this error. 


By using both losses at the same time, the hope is that the pixels in our 
generated image will change in a way that will cause them to look 
more like the content image, while stmultaneously have a style that’s 
more like the style reference. 


Let’s look at these two losses. 


28.4.3 Content Loss 


To enable us to compute the content loss, we take a pre-processing step 
before we even begin making our generated image. We just run the 
base image through our network, and save the output tensor coming 
out of each convolution layer. Figure 28.17 shows the idea, continuing 
to use the pre-trained VGG16 network. 
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Figure 28.17: Before we start style transfer, we run the base image through 
our network. We save the activation maps for each layer we will want to 
include in the construction process. Here we've shown the outputs of 
three arbitrary layers. 


Let’s now imagine we're doing style transfer. Since we’re just starting, 
our generated image is filled with noise. We'll present that image to 
the network, and collect the activation maps produced at each layer in 
response. 


So now we have two tensors for each layer. First, we have the activa- 
tion maps we saved from when we gave the network the base image. 
Second, we have the new activation maps produced in response to the 
generated image we just gave the network. We'll find the difference 
between every corresponding pair of values in these two tensors (we 
usually multiply each such difference by itself first, so it’s always posi- 
tive, and bigger differences have more impact). We'll add up all those 
differences, and that will be the error for that layer. We'll add together 
this sum from every layer we want to include, and that’s our content 
loss for this input image. Figure 28.18 shows the idea visually. 
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Figure 28.18: To compute the content loss, we find the outputs for the 
current image (white), and compare them element by element to the 
saved outputs from the base image (blue). The sum of all of those differ- 
ences is the content loss. In this figure, the rounded rectangle with two 
tensors inside stands for the operation of conceptually finding the differ- 
ences between those two tensors element by element, and summing 
those differences together. 


Let’s get a feeling for these saved activations by looking at them. We'll 
follow the same procedure we used earlier for visualizing a layer, but 
instead of summing up the outputs from multiple layers, we'll follow 
a version of Figure 28.18, where we only find the loss due to a single 
layer. We'll compare the generated image’s responses with the ones 
we saved from the base image. The more different they are, the higher 
the loss, which will ultimately cause the pixels to change to reduce that 
loss. We'll run the process over and over until the results stop changing. 


Using the frog for our base image, the content activations are shown in 
Figure 28.19. 
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Figure 28.19: The result of synthesizing a new image from noise, while 
trying to match the activations of different individual layers of VGG16. 


As we'd probably expect, the early activations do a great job of picking 
up details in the input image, so when we try to match them we get 
something a lot like the frog that generated those activations. As we 
get further into the network, the layers are looking for bigger features, 
so our attempt to match each layer’s output looks less and less like the 
frog. 


We'll save our base image’s activation layers off to the side somewhere, 
to be used later when we compute the content loss during style transfer. 
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28.4.4 Style Loss 


The style loss also depends on a pre-processing step taken before style 
transfer. In this step, we save some information resulting from run- 
ning our style reference through our network. 


Our style reference in this section will be a 1907 self-portrait by Pablo 
Picasso, shown in Figure 28.20. 





Figure 28.20: A 1907 self-portrait by Pablo Picasso. This will be our style 
reference in the following figures. 


Like content loss, we'll run our style reference through the network 
and save some values. But rather than save the activations from each 
layer, we save the Gram matrices that we build from those activations. 


See Figure 28.21 shows the idea. 
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Figure 28.21: We run our style image through the network, and for every 
layer we're going to want to use in the construction later, we build and 
save a Gram matrix. The size of each matrix is given by the number of 
filters in the layer, so the size of the tables grows as we move deeper into 
VGGI16. 


To compute the style loss, we'll take a similar approach to computing 
the content loss. We'll run our input through the network, and compare 
the Gram matrices at each layer we’re interested in with the matrices 
we saved from the style reference. Figure 28.22 illustrates the idea. 


1629 


Chapter 28: Creative Applications 






















block 5 





block 1 block 2 



































ww OW 




















































































































build matrix build matrix build matrix 
difference difference difference difference difference 
















































































ea bl 
a 


style 
loss 


Figure 28.22: Computing the style loss. We run our current image through 
the network, and from each layer’s output tensor we compute a Gram 
matrix. Then we find the difference between each matrix entry and the 
corresponding entry we saved in Figure 28.21 from the style reference. 
We add up all those differences to get the style loss. As in Figure 28.18, 
the rounded rectangles tell us to essentially find the element by element 
difference between their matrices, and add them all together. 


This loss will cause the pixels in the input to change so that the result- 
ing activations, after being turned into Gram matrices, more closely 
match the Gram matrices we saved from the style reference. 


Let’s run noise through our system, and try to get it to match the Gram 
matrices we saved from a style reference. 
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Let’s match just one layer at a time, as we did for the content loss. That 
is, well run Figure 28.22, but only measure the loss from one layer at 
a time. Using an input image of noise with the same aspect ratio as our 
frog, we get Figure 28.23. 


conv 3 
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Figure 28.23: The result of getting noise to match the Gram matrix for 
each layer of VGG16. 


This is remarkable. 


It shows that the Gram matrices really do seem to capture style infor- 
mation. In particular, the layers in block 3 are doing a great job. They’re 

showing splotches of colors bordered by black lines, just like the style 

reference. 
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Let’s try a small variation, and instead of computing the loss layer 
by layer, we’ll compute the loss cumulatively. So for each layer, we'll 
compute the style loss as the sum of the losses of all layers up to, and 
including, that layer. Figure 28.24 shows the results. 


block 1 
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block 5 





Figure 28.24: The result of getting noise to match the Gram matrices in 
VGG16, but in each case we compute the loss as the sum of all the layer 
losses up to that layer. For example, the second layer in block 3 uses the 
sum of both layers in block 1, both layers in block 2, and the first two 
layers in block 3. 


This is even better! By the time we get to block 3, we're generating 


abstracts that have a lot of similarity to our original style reference in 
Figure 28.20. The splotches of color show similar gradual changes in 
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color, and there are even what look like rough brushstroke textures. 
The layers in blocks 4 and 5 don’t contribute much additional style 
information (as we saw in Figure 28.23). 


28.4.5 Performing Style Transfer 
We've got all the pieces we need to do style transfer. 


We make some random noise, and feed it to our network. We gather 
up the activations from all the layers we care about. From some layers, 
we compute the content loss. From some layers, we compute the style 
loss. 


We'll add one more step and weight the content and style losses, so we 
can vary which one has the most influence. Then we add those values 
together to make the total loss, run backprop, adjust the pixels, and 
repeat. Gradually, the pixels will change in a way to minimize the two 
losses at the same time, giving us a picture that looks like the base 
image but has the style of the style reference. 


Figure 28.25 summarizes the process visually. 
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Figure 28.25: Style Transfer. Starting with a noisy image, we compare 
the filter activations on selected layers with the values we saved for the 
content. Their differences are added together and scaled by how much 
influence we want the style to have on the result. We also build Gram 
matrices from the input, and compare those at our selected layers to the 
saved Gram matrices. We add up all the differences and scale that result 
by the style weight. Adding together the content and style values we get 
the loss, which we use with backpropagation to find out how to adjust the 
input pixels to better match both the content and style simultaneously. 


Let’s look at some results. Figure 28.26 shows 9 different paintings, 
each with a different style. We'll use these as our style images below. 
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Figure 28.26: Nine images with different styles, which will serve as our 
style references. From left to right and top down, they are “The Starry 
Night,” by Vincent Van Gogh, “The Shipwreck of the Minotaur,” by J.M.W. 
Turner, “The Scream,” by Edvard Munch, “Seated Female Nude,’ by Pablo 
Picasso, “Self-Portrait 1907,’ by Pablo Picasso, “Nighthawks,” by Edward 
Hopper, “Sergeant Croce,” by the author, “Water Lilies, Yellow and Lilac,” 
by Claude Monet, and “Composition VII” by Wassily Kandinsky. 


Let’s apply these styles to our old friend the frog. Before sending each 
image into VGG16, we scaled it to the expected size of 224 by 224. 
To show the results here, we scaled each output back to the original 
image’s size. 


Figure 28.27 shows the results. 
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Figure 28.27: Applying the nine styles in Figure 28.26 to a photograph of 
a frog (top). 


Wow. That worked great. 


These images bear close examination, because there’s a lot of detail in 
them. At a first glance we can see that the color palette of each style 
reference has been transferred to the frog photo. But notice the tex- 
tures and edges, and how blocks of color are shaped. These images 
are not just color shifted frogs, or some kind of overlay or blend of two 
images. Instead, these are high-quality, detailed images of the frog in 
the different styles. To see this more clearly, Figure 28.28 shows the 
same zoomed-in region from each frog. 
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Figure 28.28: The nine styled frogs of Figure 28.27. Up top we mark a 
section around the frog’s front leg. Below we show that section enlarged 
from each style. 


Looking at these left to right, top to bottom, we can see that the frog 
based on “The Starry Night” is composed of many short strokes, each 
loaded with multiple colors. The frog from “Wreck of the Minotaur” 
shows smooth but textured regions. The frog based on “The Scream” is 
drawn with long, flowing strokes, most of which are of a similar color 
to their neighbors. The frog based on “Seated Female Nude” is a bit 
disappointing because it doesn’t seem to replicate the strong edges and 
lines in the original, but it does match the low contrast within regions 
of similar color. The frog from “Self Portrait 1907” shows the rough 
brushwork in the original. The frog based on “Nighthawks” is rendered 
with blocks of fixed color. The frog based on “Sergeant Croce” uses big 
regions of mostly solid colors, some of which are outlined in black. The 
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frog based on “Water Lilies” is drawn with dabs of paint from a muted 
palette, and the frog from “Composition VII” has the kind of vibrantly 
colored, high-contrast shapes that characterize the painting. 


We appear to have successfully transferred style! 


Let’s look at applying these styles to another couple of images. 


In Figure 28.29 we’ve applied our nine styles to a photograph of a land- 
scape with mountains. 





Figure 28.29: The nine styles of Figure 28.26 applied to a landscape 
photograph of mountains (top). 
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Finally, in Figure 28.30 we apply our styles to a photograph of a town. 





Figure 28.30: Applying our nine styles of Figure 28.26 to a photograph of 
a town seen from above (top). 


These images are all the more remarkable when we remember that 
every one of them started out as random noise. 


To make these images, we used VGG16. For the content loss, we used 

only the output of the second convolution layer on the first block (we 

chose this rather than the first layer because it seemed to produce a lit- 
tle less pixel-level speckling). For the style loss, we used the outputs of 
all the convolution layers in the network. 
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We weighted the content loss by 0.025, and the style loss by 1, so the 
style had 40 times more influence on the changes to the pixels than 
the content did. In these examples, a little bit of content went a long 
way. 


We also used a very small amount of a third loss. This was found by 
adding up the differences between neighboring pixels in the generated 
image. The idea is that most pixels should be close in color to their 
neighbors, so by minimizing this loss we suppress another little bit of 
speckling at the pixel level. The weight for that loss was nearly negligi- 
ble at about 0.0001. 


All of these parameters were chosen by trial and error. Different 
choices of content and style will probably look their best with different 
parameters. 


The code we used to create the generated images in this section was 
adapted from [Chollet17] and [Majumdar17]. 


28.4.6 Discussion 


As the above figures show, the basic algorithm of neural style transfer 
produces terrific results. The technique has been extended and modi- 
fied in many ways to improve the flexibility of the algorithm, the types 
of results it produces, and the range of control that artists can apply to 
create the results they want [Jing17]. It’s even been applied to video 
and spherical images that completely surround a viewer [Ruder17]. 


The whole enterprise is driven by how we compute the loss. 


The way we measure content loss seems reasonable. If a generated 
image causes the filters on an early network layer to respond in the 
same way that they responded to the content image, then the details 
in the generated image are likely to be similar to those in the content 
image. 
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The style loss is a bit more mysterious, though. We promised to return 
to this question before, and now we'll tackle it. Why do those Gram 
matrices do such a fantastic job of capturing our idea of style? 


The anticlimactic answer is that nobody really knows [Li17]. There 
are different ways to write down the mathematics of what the Gram 
matrices are measuring, but that doesn’t help us understand why this 
technique captures this elusive idea we call “style.” Neither the orig- 
inal paper on neural style transfer [Gatys15], nor a somewhat more 
detailed follow-up [Gatys16], explains how the authors hit on this idea 
or why it works so well. 


One way to think about it is that the Gram matrix has a large value for 
pairs of filters that both responded strongly in about the same loca- 
tions in the layer’s input. As we saw in Figure 28.14, Figure 28.15, and 
Figure 28.16, when the activation maps of the two filters being com- 
pared have a lot of overlap, and their values are large in that overlap, 
then we'll get a correspondingly large value in the Gram matrix. So any 
time an entry in the Gram matrix has a large value, we can say that the 
two filters it corresponds to are both reacting strongly to many similar 
locations in the layer’s input. 


That much we know. In hindsight, maybe this characteristic catches 
style because style is the result of consistently using two or more ways 
of structuring the appearance of an image. 


For example, we might say that a painting has a certain style if much of 
the surface contains blocks of mostly solid color that abut one another 
with straight lines. Or maybe the painting has the same blocks of color, 
but there’s usually a black line between them. Maybe we say that a 
painting has a certain style if the brush strokes are smooth and change 
color slowly over their lengths, and are bordered by other strokes of 
the same quality and nearly the same color. 
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If, and that’s a big if, these sorts of things are what we have come to 
call “style,” then the Gram matrix approach makes sense. After all, the 
filters detect the qualities of the image (e.g., a block of color, or a flow- 
ing stroke), and the Gram matrices tell us when two or more of our 
filters found what they were looking for in the same places. 


If that’s the case, then perhaps we would also recognize specific styles 
when 3 types of appearance qualities are used, or 4, or more. It could 
be fun to build a 3D table and fill it up with the similarities between 
3 filters, or a 4D table for 4 filters, and so on, and see what kinds of 
results we get. 


As with deep dreaming, neural style transfer is a general algorithm that 
allows for a lot of variation and exploration. There are surely many 
interesting and beautiful artistic effects waiting to be discovered. 


28.5 Generating More of This 
Book 


Just for fun, we ran the text of this book (except for this section) 
through an RNN that generates new text word by word, as discussed 
in Chapter 22. The full text, including the code listings and figure cap- 
tions, but not the references, contains about 427,000 words, drawn 
from a vocabulary of about 10,300 words. To learn this text, we used a 
network built from two layers of LSTMs, with 128 cells each. 


The algorithm develops its output by finding the next most likely word 
given the text it’s created so far, then the next most likely word, then 
the next, and so on, until we stop it. Generating text by this way is like 
the game of creating messages by choosing from the words suggested 
by a cell phone, based on the words we've used so far [Lowensohn14]. 
Of course, that’s no coincidence, since most phones are probably using 
an algorithm like this to choose the words they offer. The only differ- 
ence is that we’re picking the words automatically. 
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Once we get past the novelty, reading big blocks of text generated word 
by word isn’t that interesting, because the prose has no meaning. 


So instead, here are a few sentences manually selected from the out- 
put after 250 iterations. They are included here exactly as generated, 
including punctuation. 


¢ The responses of the samples in all the red circles share two 
numbers, like the bottom of the last step, when their numbers 
would influence the input with respect to its category. 


¢ The gradient depends on the loss are little pixels on the wall. 


¢ We know how to measure the error as a sequence of layers 
blended in the transformation being selected until we see 
some negative values. 


¢ But before that’s finding an action that accepts the probabil- 
ity that the cars have been correctly dependent on the GPU 
that was looking at those bills. 


« Let’s look at the code for different dogs in this syllogism. 
It’s surprising how close these come to making sense! 


Whole sentences are fun, but we found that the most entertaining bits 
came out soon after the start of training, when we were getting only 
fragments. Here are some manually selected excerpts after just 10 
epochs, again presented verbatim with no editing. 


¢ Set of of apply, we + the information. 

« Because to # function with only 4 is the because which training. 
¢ Suppose us only parametric. 

¢ This by this know we on value autoencoder. 


¢ The usually quirk (alpha train had we than that to use them 
way up). 
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These are mostly incoherent, but from these synthetic phrases there 
is a truth: one of the primary goals of this book is indeed to “+ the 
information.” 
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Datasets 


Finding good and plentiful data 

is the first step in training a new learner. 

Here we look at some popular datasets, and 
popular repositories that link to much more data. 


Chapter 29: Datasets 
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29.1 Public Datasets 


Training a learner can take a lot of time, effort, and compute power. 
Training deep learners with big datasets can take a lot of time and 
computing. It’s generally best, whenever possible, to build on an exist- 
ing system that someone else has created and made publicly available. 


But sometimes that system isn’t quite what we want. We might be able 
to use transfer learning to tune it to our needs by training it some more, 
or training some new layers, with data that’s closer to what we want. 
When that’s not appropriate, we can train a new system from scratch 
with a dataset of our choice. 


There are many datasets available online. Some have restrictions on 
their use. Many collections of faces, for example, can only be used for 
research or educational uses, and even then we must ask the authors for 
permission first. Other datasets contain copyrighted or privately-held 
data, and again may only be used with permission. It’s important to 
check the rules attached to any dataset before investing too much time 
or energy into it. 


It’s also important to manually inspect any dataset we find online 
to make sure that the data is clean enough for our purposes. It’s not 
unusual to find blank fields, one-off labels, spelling errors, missing 
images, dropped frames, and other problems. 


The net is nothing if not ephemeral, with new datasets popping up all 
the time and older ones disappearing without warning. 


Here are some good starting points that are all available as of early 
2018. 
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29.2 MNIST and Fashion-MNIST 


In this book we’ve used the MNIST database for lots of examples. 
MNIST is great because it’s simple and clean, but those same qualities 
limit how well it represents real data. 


A bridge between MNIST and bigger, more complicated datasets is the 
Fashion-MNIST dataset. 


This is a collection of 70,000 images of 10 different categories of fash- 
ion items, such as shirts, shoes, and bags. As the name suggests, this 
database is meant to be a different version of MNIST, but with the 
same structure. The images are still grayscale squares in 10 categories, 
pre-split into 60,000 training images and 10,000 test images. 


If a learner does well with MNIST, then a natural next step is to crank 
up the difficulty a little and try it on Fashion-MNIST. If it stumbles, 
the dataset is still simple enough that we may be able to examine the 
results visually and work out what’s going wrong. 


MNIST 
http://yann.lecun.com/exdb/mnist/ 


Fashion MNIST 
https://github.com/zalandoresearch/fashion-mnist 
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29.3 Built-in Library Datasets 


Many machine learning libraries provide datasets directly. 


29.3.1 scikit-learn 


The scikit-learn library we discussed in Chapter 15 provides access 
to synthetic and real datasets. More information is available at 
http://scikit-learn.org/stable/datasets/index.html 


20 newgroups 
Text of 18,000 newsgroup posts on 20 topics. 


Boston housing prices 
Numerical data for 500 different houses in Boston in the 
1970s, each containing 13 attributes. 


Diabetes 
Text of 442 samples of diabetes patients, each with 10 
attributes. 


Digits 
5620 8x8 grayscale images of handwritten digits, with labels. 
Forest covertypes 
Text samples with 54 features describing patches of forest. 
Iris 
150 text samples, each with 4 attributes for 3 different types of 
Iris. 
Labeled Faces in the Wild 
13,000 50x37 labeled color images of people. 


Linnerrud 
20 text samples describing observations from people perform- 
ing 3 types of exercise . 
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Olivetti faces 
10 64x64 grayscale images each of 40 people. 


RCV1 
Text of 800,000 manually categorized newswire stories. 


Wisconsin breast cancer 
Text of 569 samples, each with 30 attributes and classification 
(malignant or benign), measured from breast cancer scans. 


29.3.2 Keras 


The Keras library (see Chapter 23) offers easy loading for its own set 
of databases. More information is available at https://keras.io/datasets/ 


Boston Housing Prices 
Numerical data for 500 different houses in Boston in the 
1970s, each containing 13 attributes (same as scikit-learn ver- 
sion) . 


CIFAR10 
50,000 32x32 color images in 10 categories. 


CIFAR100 
50,000 32x32 color images in 100 categories. 


Fashion-MNIST 
60,000 28x28 grayscale images of clothing and accessories in 
10 categories. 


IMDB Movie Reviews 
Text of 25,000 movies reviews, labeled by sentiment (either 
positive or negative). 


MNIST digits 
60,000 28x28 grayscale hand-written digits labeled by digit. 


Reuters Newswire Topics Classification 
Text of 11,000 articles from Reuters, labeled by 46 topics. 


1653 


Chapter 29: Datasets 


29.4 Curated Dataset Collections 


The following websites offer curated lists of datasets. There are hun- 
dreds of datasets available, spanning a wide range of topics, and 
offering their data in as formats such as text, pictures, 3D models, 
and video. Some of these datasets are best for classification problems, 
others are intended for regression. There’s a lot of overlap in these col- 
lections, but each has unique entries as well. 


Awesome Deep Learning 
About 120 datasets. 
https://github.com/ChristosChristofidis/awesome-deep-learning 


Deep Learning Net Collection 
About 52 datasets. 
http://deeplearning.net/datasets/ 


Deep Mind Open Source Datasets 
7 datasets. 
https://deepmind.com/research/open-source/open-source-datasets 


Kaggle 
Over 1000 datasets. 
https://www.kaggle.com/datasets 


Open Data for Deep Learning 
About 100 datasets presented in 15 categories. 
https://deeplearning4j.org/opendata 


Statlib Datasets Archive 
About 100 datasets. 
http://lib.stat.cmu.edu/datasets/ 


Time Series Data Library 
20 time series datasets. 
https://datamarket.com/data/list/?q=provider:tsdl 
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UCI Machine Learning Repository 
About 400 datasets. 
https://archive.ics.uci.edu/ml/index.php 


Wikipedia Datasets 
About 400 datasets. 
https://en.wikipedia.org/wiki/ 
List_of_datasets_for_machine_learning_research 


29.5 Some Newer Datasets 


Here are some datasets that are fun or interesting, and as of early 2018 
may not yet be in the curated collections in the previous section. 


AVA Dataset 
80 visual actions in 57,600 movie clips. 
https://research.google.com/ava/ 


Celebrity In Places 
38,000 images labeled by person and location. 
http://www.robots.ox.ac.uk/~vgg/data/celebrity_in_places/ 


COCO 
Common Objects in Context: 200,000 labeled images, 80 cate- 
gories. 
http://cocodataset.org/#home 


Cornell Movie-Dialogs Corpus 
220,000 text exchanges from movie scripts. 
http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus. 
html 


Google Speech Commands Dataset 
65,000 1-second utterances of 30 different words. 
https://research.googleblog.com/2017/08/launching-speech-com- 
mands-dataset.html 
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Kinetics 
300,000 video clips labeled for 400 human action classes. 
https://deepmind.com/research/open-source/open-source-datasets/ 
kinetics/ 


Large-scale CelebFaces Attributes (CelebA) Dataset 
200,000 images of about 10,000 celebrities, each with landmark 
locations and attributes (such as “eyeglasses,” “wavy hair”, and 
“oval face”). 
http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html 


LSUN (Large-Scale Scene Understanding) 
Almost 10 million images of locations in 10 categories. 
http://lsun.cs.princeton.edu/2017/ 


Matterport3D 
90 homes with 3D meshes, image annotations, depth images, 
and more. 
https://github.com/niessner/Matterport 


Million Song Dataset 
Audio features and metadata for 1 million pop songs. 
https://labrosa.ee.columbia.edu/millionsong/ 


MSRA-CFW 
Data Set of Celebrity Faces on the Web: URLs for 203,000 faces 
with labels, copyright protected. 
https://www.microsoft.com/en-us/research/project/ 
msra-cfw-data-set-of-celebrity-faces-on-the-web/ 


The NSynth Dataset 
Audio clips of 306,000 annotated musical notes for 1,006 instru- 
ments. 
https://magenta.tensorflow.org/datasets/nsynth 


Open Images Dataset 
9 million images with labels and bounding boxes. 
https://github.com/openimages/dataset 
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Quick, Draw! The Data 
50 million object doodles. 
https://quickdraw.withgoogle.com/data 


Quora Question Pairs 
400,000 lines of potential duplicate questions. 
https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs 


Street View Image, Pose, and 3D Cities Dataset 
25 million images with matching image pairs, camera pose, and 
3D models. 
https://github.com/amir32002/3D_Street_View 


YouTube-BoundingBoxes Dataset 
380,000 15-20 second videos with labeled object bounding 
boxes. 
https://research.google.com/youtube-bb 
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Glossary 


A summary of many of the key 
terms and phrases used in this book. 


Chapter 30: Glossary 
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About The Glossary 


Many terms used in machine learning come from other fields, particu- 
larly statistics, probability, and other mathematical disciplines. These 
terms are often have precise definitions that rely on additional math- 
ematical ideas. In this glossary, as in the book, we prefer the simplest 
interpretations that capture the main idea. 


Most entries in this glossary are followed by the general topic they 
refer to. This is to help disambiguate words that have multiple mean- 
ings when applied to different ideas. Each entry is also followed by 
the chapter number in brackets, identifying where it first appears, 
e.g., <21> refers to Chapter 21. 


The descriptions given here are short reminders of the words and 
phrases they describe, rather than complete descriptions. For more 
details, consult the text. 


U9 


1 by 1 filter (CNNs) <21> 
A 2D convolution filter with width and height 1. 


1-dimensional space (statistics) <2> 
A conceptual environment where data with a single value can 
be represented. A line is a 1-dimensional space, with the single 
dimension of distance from a fixed reference point. 


1D convolution (CNNs) <21> 
Convolution where the filter is moved in only one dimension. 


2-dimensional space (statistics) <2> 
A conceptual environment where data with two values can be 
represented. A sheet of paper is a 2-dimensional space, with 
the 2 dimensions of width and height. 
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2D binary classification (classification) <7> 
The task of assigning labels to 2D samples that each belong to 
one of two classes. 


3-dimensional space (statistics) <2> 
A conceptual environment where data with two values can be 
represented. A cardboard box is a 3-dimensional space, with 
the 3 dimensions of width, height, and depth. 


Greek Letters 


Ot (reinforcement learning) <21> 
See alpha. 


B (optimizers) <21> 
See beta. 


Y (optimizers) <19> 
See gamma. 


6, A (backpropagation) <18> 
See delta. 


€ (reinforcement learning) <25> 
See epsilon. 


1 (backpropagation) <21> 
See eta. 


(backpropagation) <18> 
See lambda. 


O (activation functions) <17> 
See sigma. 
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A 


accuracy (probability) <3> 
The percentage of samples that are correctly labeled. 


action (reinforcement learning) <25> 
Instructions from an agent to the environment. 


activation function (neurons) <10> 
The last step in an artificial neuron. The result of summing 
all the weighted inputs is used as the input to this function. 
Its output is the output of the neuron. Activation functions 
introduce a non-linear step into the neuron’s calculations, 
preventing a sequence of neurons from collapsing into just 
one equivalent neuron. 


actor (reinforcement learning) <25> 
Another name for agent. 


Adaboost (ensembles) <14> 
A learning algorithm based on boosting. 


Adadelta (optimizers) <19> 
An optimization technique that improves on Adagrad by using 
a decaying sum of past gradients, so that old gradients make 
weaker contributions to the changes applied to weights. 


Adagrad (optimizers) <19> 
An optimization technique that adapts the size of the gradient 
for each weight by dividing it by a running sum of past gradi- 
ents for that weight. 


Adam (optimizers) <19> 
An optimization algorithm that extends Adagrad and 
RMSprop by maintaining two lists of gradient information. 
This can allow it to better adjust the learning rate for each 
weight. 
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adaptive code (information theory) <6> 
Given the likely distribution of symbols in a message, a way 
to represent those symbols so the message is as compact as 
possible. 


Adaptive Gradient Learning (optimizers) <19> 
See Adagrad. 


Adaptive Moment Estimation (optimizers) <19> 
See Adam. 


adversarial perturbation (CNNs) <21> 
An image that usually looks like noise. When added to an 
input image that is correctly classified by a classifier, the 
results are often imperceptible to the naked eye, yet the classi- 
fier will assign the wrong class to this modified image. 


adversary (CNNs) <21> 
An input designed to cause a neural network, usually a con- 
vnet, to produce an incorrect output. Adversaries often appear 
to the naked eye to be identical to images that result in the 
correct category. 


affirming the consequent (reasoning) <11> 
A syllogistic fallacy. In schematic form, it incorrectly asserts 
that because 1) All A are B, and 2) C is B, therefore 3) C is also 
A. 


agent (reinforcement learning) <1> 
An agent takes actions that cause a change to the environment. 


alpha (reinforcement learning) <16> 
A value used to control the blending of old and new values 
when updating a Q-table. 


ancestor (decision trees) <16> 
A node that is closer to the root than another, given node. 
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Anscombe’s quartet (statistics) <16> 
A set of 4 distinctly different 2D data sets with the same 
means, variances, correlations, and best-fit straight lines. 


anchor (CNNs) <21> 
The element in a filter that is placed over the element of inter- 
est in the input tensor. 


appeal to coincidence (reasoning) <11> 
A fallacy of logical induction that comes from ignoring the 
most obvious conclusion. 


array (scikit-learn) <15> 
In NumPy, an array is a block-shaped data structure of any 
number of dimensions. It is a synonym for tensor. 


artificial intelligence <1> 
The general name for the field that aims to find programmable 
processes that result in the kind of behaviors that people call 
“intelligent.” Machine learning and deep learning are subfields 
in this field. 


artificial neuron (neural networks) <1> 
Small computational units that accept a collection of numbers 
as input. Each number is multiplied by a corresponding num- 
ber called its weight. These results are added together, with 
another number called the bias. The total is then run through 
an activation function to produce a number as a result. 


asynchronous (feedforward networks) <16> 
Two or more events that happen on their own schedules, with- 
out depending on each other’s timing. 
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augmenting (scikit-learn) <15> 


Adding information to some object. When speaking of data 
sets, we can create additional samples, often by making varia- 
tions on existing samples. When speaking of fitting curves and 
surfaces to data, we can create additional features for each 
sample, often by combining the existing features. 


autoencoder (autoencoders) <24> 


A semi-supervised machine learning architecture that learns 


how to represent its input using fewer variables than the input 
itself. 


automatic Bayes (Bayes’ Theorem) <4> 


Using a rule or algorithm to create the prior for an application 
of Bayes’ Theorem. 


B 


backend (Keras) <23> 


Any of several deep learning libraries that implement a net- 
work created with the Keras library. 


backprop (backpropagation) <18> 
See backpropagation. 


backpropagation (backpropagation) <18> 
An algorithm used in deep learning to efficiently determine 


the gradient of the network’s error function with respect to 


each weight in the network. It is typically followed by updat- 
ing, which actually modifies the values of the weights. 


backpropagation through time (RNNs) <22> 


A modified version of backpropagation that can be used to 
train RNNs. 
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bagging (ensembles) <14> 
A technique for building ensembles using bootstrap aggrega- 
tion. We create multiple bootstraps from a starting population, 
and train a learner on each, creating an ensemble. 


balanced (decision trees) <13> 
A tree or subtree with a symmetrical shape. 


batch (backpropagation) <18> 
This usually refers to an entire dataset that is used to train a 
network. See batch gradient descent. 


batch gradient descent (optimizers) <19> 
The entire training data set, or batch, is used to train a net- 
work. When that’s done, the network’s weights are updated. 


batch normalization (deep learning) <20> 
See batchnorm. 


batchnorm (overfitting) <9> 
A regularization method meant to delay the onset of overfit- 
ting in deep learning. The outputs from a layer are collected 
over a batch (or mini-batch) of training data. They are then 
collectively normalized before being sent on to the next layer. 
Some libraries encapsulate the batchnorm algorithm in a 
helper layer of its own. 


Bayes’ Rule (Bayes’ Theorem) <4> 
Another name for Bayes’ Theorem. 


Bayes’ Theorem (Bayes’ Theorem) <4> 
A relationship that lets us find the conditional probabil- 
ity P(A|B), called the posterior. We multiply the likelihood 
P(B|A) by the prior P(A), and divide by the evidence P(B). 


Bayesian (Bayes’ Theorem) <4> 
A problem-solving method based on Bayes’ Theorem, or a per- 
son who uses such a method. 
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behavioral psychology (reasoning) <11> 
Another name for behaviorism. 


behaviorism (reasoning) <11> 
A field of psychology that emphasizes understanding creatures 
based solely on their observable actions or behaviors. 


bell curve (statistics) <2> 
An informal name for the type of smooth curve followed by a 
Gaussian distribution. 


Bernoulli distribution (statistics) <2> 
A probability distribution for a random variable that can take 
on only two values, such as 0 and 1, or yes and no. 


beta (3) (optimizers) <21> 
Either of two variables (called (§1 and 82) used to control the 
Adam algorithm. 


bias (deep learning) <4> 
A value added explicitly to the sum computed by an artificial 
neuron. 


bias (surface fitting) <5> 
A property of an algorithm or a prior that causes fit curves or 
surfaces to be similar to one another, despite differences in 
the data they’re trying to match. 


bias trick (neurons) <10> 
A way to write the equations of an artificial neuron so that the 
bias term can be considered just another input with an associ- 
ated weight. This greatly simplifies implementations. 


biased sampling (reasoning) <11> 
An induction fallacy that comes from seeing what we want to 
see, despite the evidence. 


bidirectional LSTM (RNNs) <22> 
A bidirectional RNN built from LSTM units. 


1667 


Chapter 30: Glossary 


bidirectional RNN (RNNs) <22> 
A layer of two RNN units where one receives inputs from start 
to finish, and the other receives them in the opposite order. 


binary (decision trees) <13> 
A type of tree where each node has only two children. 


binary classification (classification) <7> 
The task of assigning a piece of data to one of two classes. 


binding (neurons) <10> 
The process by which a neurotransmitter attaches onto a 
receptor site on a neuron. 


bit (information theory) <6> 
A unit of information with two possible states. They are usu- 
ally represented as 0 and 1. 


black box (deep learning) <20> 
A term sometimes used to describe a learner with low 
explainability. 


blessing of non-uniformity (classification) <7> 
See blessing of structure. 


blessing of structure (classification) <7> 
A counterpoint to the curse of dimensionality, which observes 
that because our data usually has significant structure, we 
often care only about regions of the sample space where sam- 
ples are relatively dense. 


block (deep learning) <20> 
A data structure formed as a 3D box. It can also be thought of 
as a 3D tensor. 


BLSTM (RNNs) <22> 
See bidirectional LSTM. 
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bold driver (optimizers) <19> 
When we use a decay schedule to adjust the learning rate of a 
neural network, this algorithm can make the decay parameter 
both larger and smaller. 


boosting (ensembles) <14> 
An algorithm for ensemble methods that combines multiple 
weak learners into a single strong learner. 


bootstrap (statistics) <2> 
A set of values selected with replacement from a larger set, 
used in the practice of bootstrapping. 


bootstrap aggregating (ensembles) <14> 
See bagging. 


bootstrapping (statistics) <2> 
A method for assigning a confidence interval to a statistical 
measure of a population. This is done by creating multiple 
bootstraps from the population, and measuring their statistics. 


bottleneck (autoencoders) <24> 
A layer with fewer parameters than its neighbors, forcing the 
network to represent the data in a compressed form. 


bottom-up (reasoning) <11> 
Another name for inductive reasoning. 


boundary (classification) <7> 
A curve or surface that divides the space into two regions, cor- 
responding to two different classes. 


boundary method (classification) <7> 
A technique for classification where curves or surfaces are 
placed in the space of the samples, intended to partition the 
space so that each region holds only one class of samples. 


BPTT (RNNs) <22> 
See backpropagation through time. 
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branch (decision trees) <13> 
Either one of the paths coming out of a node on a decision 
tree, or the entire sub-tree starting with the first node after 
taking that path. 


BRNN (RNNs) <22> 
See bidirectional RNN. 


bushy (decision trees) <13> 
A decision tree that is wide, usually because many nodes have 
multiple children. 


C 


Caffe (autoencoders) <24> 
An open-source library for deep learning. 


callback (Keras) <23> 
A routine that we provide to the library to be invoked at the 
end of each epoch of training. 


candler (classification) <7> 
A person who determines whether or not a chicken egg is fer- 
tilized, traditionally by looking through it at a candle. 


candling (classification) <7> 
See candler. 


capacity <1> 
A measure of the complexity of the representation of data that 
can be built by a given machine-learning model. Often this is 
expressed by the number of learnable parameters. 


categorical data (data prep) <12> 
Almost any kind of data except numerical. 


categorical distribution (statistics) <2> 
An alternate name for the multinoulli distribution. 
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categorical syllogism (reasoning) <11> 
A template for a logical argument. In standard form, it is built 
from three statements that make assertions about a subject, a 
middle term, and a predicate. The first statement, called the 
major premise, expresses a relationship between the mid- 
dle term and the predicate. The second statement, called the 
minor premise, expresses a relationship between the subject 
and the middle term. The third statement, called the conclu- 
sion, eliminates the middle term to establish a relationship 
between the subject and the predicate. If the logic is correct, 
the conclusion is valid, otherwise it is invalid. If the conclu- 
sion is valid and the first two statements are actually true in 
the world, the result is sound, otherwise it is unsound. 


categorical variable decision tree (decision trees) <13> 
A decision tree used for categorization. 


categorization <1> 
An alternate name for classification. 


categorizer <1> 
An alternate name for a classifier. 


category (probability) <3> 
In classification problems, an alternate name for the class. 


cell memory (RNNs) <22> 
The memory hosted inside of an RNN unit. 


Central Processing Unit <1> 
The general-purpose computing and control unit at the heart 
of most computers. 


channel (Keras) <23> 
One layer of an image. A grayscale image has a single channel, 
while a color image stored with red, green, and blue values at 
each pixel has three channels. Also used to describe tensors 
that represent images, or data derived from images. 
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checkpoint (Keras) <23> 
A file containing the weights of a model, saved during training. 


child (feedforward networks) <16> 
In a tree-like graph, a node that is immediately below another 
node. 


children (decision trees) <13> 
The nodes that are after, or dependent on, a given node. 
Terminal nodes, or leaves, have no children. 


clamp (statistics) <2> 
A mathematical operation involving an input value, an upper 
limit, and a lower limit. If the value is between the limits, it is 
unchanged. If the value is greater than the upper limit, it’s set 
to that limit. If the value is less than the lower limit, it’s set to 
that limit. 


class (classification) <7> 
A name for a sample, or a group of samples that share the 
same label. Often training data includes a manually-assigned 
class label for each sample, with the goal of developing an 
algorithm that will correctly predict the class of new samples. 


classification <1> 
The process of using a trained learner to determine the likeli- 
hood that an input belongs to different classes. 


classifier <1> 
A machine learning system for performing classification. 


clustering <1> 
The act of automatically organizing unlabeled data into groups 
that share some useful kind of similarity. 


clustering algorithm <1> 
A type of unsupervised learning algorithm that takes in unla- 
beled data and attempts to use measures of similarity to 
organize the data into a given number of classes. 
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CNN (CNNs) <21> 
See convolutional neural network. 


CNN-LSTM network (RNNs) <22> 
A network that starts with one or more convolutional layers 
that are followed by one or more LSTM layers. 


CNTK (deep learning) <23> 
An open-source library for deep learning. 


compiling (Keras) <23> 
The act of transforming a specification of a model into code 
that can be executed on a particular backend library. 


complexity (overfitting) <9> 
Sometimes used as an alternate name for capacity. 


compression (autoencoders) <24> 
The process of converting an input into a representation that 
requires fewer bits. This may be done in a lossless or lossy 
manner. 


compression ratio (information theory) <6> 
The size of a compressed message relative to its uncompressed 
size. 


conclusion (reasoning) <11> 
The third statement in a standard-form categorical syllogism. 


conditional probability (probability) <3> 
The chance that one statement is true, given that another 
statement is known to be true. If statement B is true, then the 
conditional probability that statement A is true is written as 
P(A|B). 
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conditional syllogism (reasoning) <11> 
A 3-line syllogism that first states a general relationship, then 
provides information on a specific instance of that relation- 
ship, and then presents a result based on logically combining 
the first two statements. 


confidence interval (statistics) <2> 
A range of values that we believe contain a statistical measure 
of interest. The interval is often accompanied by a numerical 
confidence that the value is indeed contained in that range. 


confusion matrix (probability) <3> 
A 2 by 2 grid that summarizes predictions of values into true 
positives, true negatives, false positives, and false negatives. 


connectome (neurons) <10> 
The collection of connections between neurons in a brain. 


constant (statistics) <2> 
A value that does not change during the course of a 
computation. 


constant-length code (information theory) <6> 
A representation of a set of symbols where each symbol is rep- 
resented by the same amount of information. 


content blending (autoencoders) <24> 
Blending two objects by interpolating their overt 
representations. 


content problem (data sets)(Keras) <23> 
A numerical error or problem in a dataset. 


context (RNNs) <22> 
For sequential data, the samples before and after a given 
sample. 
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context (RNNs) <22> 


Information about the environment surrounding a piece of 
data. 


continuous (statistics) <2> 
A property of a curve or surface of any number of dimensions. 
It tells us that there are no abrupt changes (such as jumps or 
cliffs) in the values of the curve or surface. Informally, a con- 


tinuous curve is one that can be drawn without lifting one’s 
pencil from the paper. 


continuous probability distribution (statistics) <2> 


A probability distribution defined for every real value in an 
interval. 


continuous variable decision tree (classifiers) <13> 
A decision tree used for regression. 


convergence (neural networks) <25> 
The process where a system’s changes gradually become 
smaller as it finds and settles into a stable situation. We often 
hope the system is converging onto its best configuration. 


convnet (deep learning) <20> 
Another name for a convolutional neural network. 


convolution (CNNs) <21> 
A mathematical operation in which a tensor of data called 
the filter, or kernel, is sequentially centered over elements in 
an input tensor. At each element, the values in the input are 
multiplied by their corresponding values in the filter, and the 
results are summed. This number is the output of the process 
for that element in the input. 


convolutional autoencoder (autoencoders) <24> 
An autoencoder that is dominated by convolutional layers. 
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convolutional layer (deep learning) <20> 


An encapsulation of the convolution operation in a deep learn- 
ing layer. 


convolutional neural network (deep learning) <20> 


A deep learning network where convolutional layers play a 
dominant role. 


correlation (statistics) <2> 
A statistical measure that tells us how well one variable’s 
changes can be predicted from another variable’s changes. 


correlation coefficient (statistics) <2> 
A statistical measure that provides a numerical value to the 
amount of correlation between two variables. 


corruption <1> 


An external influence (usually noise) that distorts the value of 
a sample. 


cost <1> 


A measure of the error in the output of a learner. Also called 
the loss. 


covariance (statistics) <2> 
A statistical measure that describes the degree to which val- 
ues of two variables are related. Positive covariance says 
that when one variable is near the top or bottom of its range, 
the other variable is respectively near the top or bottom of 
its range. Negative covariance says that this relationship is 
reversed. The larger the magnitude of the covariance, the 
stronger the relationship. 


cpd (statistics) <2> 
See continuous probability distribution. 


CPU <1> 
See Central processing unit. 
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credit assignment problem (reinforcement learning) <25> 
The task of determining which actions taken by the agent 
should change in response to the ultimate reward. 


cropping layer (deep learning) <20> 
A helper layer that considers the input to be an image of 1 or 
more channels, and trims elements away from the left, right, 
top, and/or bottom edges of the image. The result is a tensor 
with fewer elements than in the input. 


cross-entropy (information theory) <6> 
We are given two probability distributions, one tuned for a 
particular class of messages and one that is not. This is the 
average number of additional bits required when we represent 
a message from the class using the non-tuned distribution, 
compared to the tuned distribution. 


cross-validation (training) <8> 
Estimating the performance of a model on new data by repeat- 
edly re-training on just part of training data, using the rest as 
test data for that particular model. 


curse of dimensionality (classification) <7> 
The phenomenon where increasing the number of features in 
a set of samples causes those samples to fill the sample space 
more sparsely. This can make accurate analysis challeng- 
ing because of the increase in empty space where there is no 
information to work with. 


cusp (curves and surfaces) <5> 


A sharp point in a curve or surface. The derivative is undefined 
at a cusp. 
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D 


DAG (feedforward networks) <16> 
See directed acyclic graph. 


dart throwing (probability) <3> 
A popular metaphor for drawing random variables from a 
given distribution. 


data amplification <1> 
The process of creating more data that is in some way like a 
given set of data. 


data augmentation (Keras) <23> 
The process of generating new data by creating simple modifi- 
cations of existing data, often by creating minor distortions of 
that data. 


data cleaning (data prep) <12> 
The process of looking at a set of data and putting it into the 
best form for machine learning. This can require judiciously 
adding and removing information based on our knowledge of 
the data and its source. 


data contamination (training) <8> 
An alternate name for information leakage. 


data generation <1> 
An alternate name for data amplification. 


data leakage (training) <8> 
An alternate name for information leakage. 


data point (machine learning) <3> 
An alternate name for sample. 


data preparation (data prep) <12> 
Another name for data cleaning. 
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data refinement (scikit-learn) <15> 
Algorithms for eliminating redundant features from data by 
either removing or combining those features. 


data scrubbing (data prep) <12> 
Another name for data cleaning. 


DCGAN (GANSs) <26> 
See deep convolutional generative adversarial network. 


dead (backpropagation) <18> 
A characterization of a neuron which has entered a state 
where small changes to its weights have no effect on its output. 


decay parameter (optimizers) <19> 
A value that controls the speed of descent of an exponentially 
decaying curve. 


decay schedule (optimizers) <19> 
When a neural network controls the learning rate over time by 
using some form of exponentially decaying curve, this is a pol- 
icy that sets the value of the decay parameter for each epoch. 


decision boundary (classification) <7> 
A curve or surface placed into a space of samples that separate 
regions of samples of different classes. 


decision stump (ensembles) <14> 
A decision tree made up of only a root and its immediate 
children. 


decision tree (classifiers) <13> 
A machine learning algorithm that applies a sequence of tests 
to input data, ultimately describing that data with a summary 
description based on the results of those tests. The technique 
is useful for classification and regression. 
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decode (autoencoders) <24> 
Convert a piece of information into another representa- 
tion. Typically, this step is performed on data that has been 
encoded, so the data is transformed into its original form of 
representation. 


deconvolution (CNNs) <21> 
A deprecated term for a convolution layer that also performs 
upsampling. Transposed convolution is now preferred. 


deduction (reasoning) <11> 
A method of logical reasoning that starts with a hypothesis, 
then collects data to prove or disprove that hypothesis. Often 
the hypothesis is narrowed in scope as the data is collected, to 
exclude those situations in which it was found to be incorrect. 
If enough supporting data cannot be found, the hypothesis 
is abandoned. Deduction is often described as a top-down 
approach, since it starts with a hypothesis and works its way 
down to observations. 


deductive logic (reasoning) <11> 
An alternate name for deduction. 


deductive reasoning (reasoning) <11> 
An alternate name for deduction. 


deep convolutional generative adversarial network (GANs) <26> 
A GAN where the networks are dominated by convolution 
layers. 


deductive dreaming (creative applications) <28> 
An algorithm that combines the filter losses at multiple layers 
to drive gradient ascent on an input image. The results are 
often psychedelic. 
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deep learning <1> 
A type of architecture used for machine learning. Typically, it 
involves multiple “hidden” layers between the input and the 
output. The more layers there are, the deeper the network. It 
also refers to a learner that is based on analyzing the input 
data hierarchically, or at ever-larger scales. Often both of 
these qualities are present in a single network. 


deep network (deep learning) <20> 
A network of multiple layers of artificial neurons, and perhaps 
other helper layers. 


deep RNN (RNNSs) <22> 
A neural network built from more than one RNN layer. 


deep reinforcement learning (reinforcement learning) <25> 
A technique that replaces the Q-table with a deep neural 
network. 


delta (backpropagation) <18> 
A letter in the Greek alphabet, 5 in lower case, or A as a cap- 
ital. In the backpropagation algorithm, delta is often used to 
represent how the change in a weight is amplified to become a 
change in the network’s error. 


denoising (autoencoders) <24> 
The process of removing noise from a signal. Often this means 
using an autoencoder to remove noise from an image. 


dense layer (Keras) <23> 
Another name for a fully-connected layer. 


denying the antecedent (reasoning) <11> 
A syllogistic fallacy. In schematic form, it incorrectly asserts 
that because 1) All A are B, and 2) Cis not A, therefore 3) C is 
not B. 
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dependent (statistics) <2> 
A variable whose value is influenced by the value of one or 
more other variables. 


deploy <1> 
Release a learner for use by others, typically so it can evaluate 
new data that the system has never seen. 


deployment data (training) <8> 
Data provided to a learner after it has been released, or 
deployed. 


depth (decision trees) <13> 
The number of branches between the root and the farthest 
leaf. 


depth (deep learning) <1> 
The number of layers in a deep learning network. 


derivative (curves and surfaces) <5> 
A measure of the slope of a curve at a given point. This tells us 
how much the value of the curve is changing at that point. 


descendant (feedforward networks) <16> 
In a tree-shaped graph, a node whose path to the root goes 
through to a given node. 


deterministic (statistics) <2> 
Predictable or repeatable. 


DFR <25> 
See discounted future reward. 


dilated convolution (CNNs) <21> 
A mechanism for upsampling during convolution that involves 
inserting new elements with value o into the input tensor. 
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dimension (Keras) <23> 
An axis of change. We often say that a piece of data has a num- 
ber of dimensions, or dimensionality, given by the number of 
values it contains. 


dimension reduction <1> 
Elimination of one or more features from a data set. This usu- 
ally results in faster training, and may improve results over a 
system trained with the original data set. 


dimensionality reduction (data prep) <12> 
The process of combining features in the samples of a dataset. 


directed acyclic graph (feedforward networks) <16> 
A graph where each edge has an arrow, or direction, and there 
are no loops. 


directed divergence (information theory) <6> 
See Kullback-Leibler divergence. 


discount factor (reinforcement learning) <25> 
A value that is used to reduce the impact of anticipated 
rewards after a given action to the end of the episode. The far- 
ther into the future a reward is, the more it is discounted (or 
made smaller). 


discounted future reward (reinforcement learning) <25> 
A technique for assigning a score to an action that takes into 
account an estimate of the rewards from future actions that 
may result. 


discrete probability distribution (statistics) <2> 
A probability distribution that is only defined for specific input 
values. 


discriminator (GANs) <26> 
The network that determines whether a given sample does or 
does not belong to a particular data set. 
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disjunctive syllogism (reasoning) <11> 
A syllogism that asserts that only one of two possibilities is 
true, then provides us with information to select one of the 
choices, and concludes with the selected choice. 


dit (information theory) <6> 
The amount of time required to send a single dot in Morse 
code, or a reference to the dot itself. 


downsampling (CNNs) <21> 
A helper layer that usually considers the input to be an image 
of 1 or more channels, and scales down the width and height 
of that image, creating a new tensor that has fewer elements 
than the input. Often, the elements are examined in small 
rectangular blocks, and processed to produce an output. 
Typical processing methods including extracting the largest 


value in the block, or taking the average of all the values in the 
block. 


draw (statistics) <2> 
The process of producing a value for a random variable. 


dropout (overfitting) <9> 
A regularization method meant to delay the onset of overfit- 
ting in deep learning. The dropout algorithm is applied to a 
specific layer. Prior to each epoch, some percentage of the 
neurons on a given layer are temporarily disconnected from 
the network (the percentage is a parameter for dropout, and 
may be different for each layer it is applied to). The train- 
ing process then proceeds as usual, though the disconnected 
neurons don’t compute new values and their weights aren’t 
updated after backpropagation. When the epoch is done, the 
neurons are reconnected, a new random set is disconnected, 
and the process repeats. 
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dropout layer (deep learning) <20> 
An encapsulation of the dropout algorithm into a helper layer, 


typically placed immediately after the layer to which dropout 
should be applied. 


dummy variable (data prep) <12> 
Another name for a variable encoded with one-hot encoding. 


E 


early stopping (overfitting) <9> 
A technique for preventing overfitting. When the test loss 
stops dropping, or rises for more than a small number of 
epochs, training is brought to a halt, regardless of how many 
epochs of training might have been initially requested. 


eigenvector (data prep) <12> 


When part of a PCA algorithm, one of the directions in which 
data is projected. 


elementwise transformation (data prep) <12> 


A transformation that independently processes every value in 
a data set. 


ELU (activation functions) <17> 
See exponential ReLU. 


encode (autoencoders) <24> 
Convert a piece of information into another representation. 
Often this new representation is more compact than the input. 


ensemble (classifiers) <13> 
A group of learners. Often they are of slightly different design, 
and/or trained on slightly different data. Ensembles are often 
used simultaneously, and the outputs of their components 
combined to create a single output by voting. 
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ensemble forecasting (ensembles) <14> 
Running the same model multiple times with slightly different 
starting conditions. 


entropy (information theory) <6> 
An estimate of the minimum number of bits required to send a 
message. It represents how much work we need to do to com- 
municate information.. 


environment (reinforcement learning) <25> 
Everything but the agent, which the agent influences by taking 
actions. 


episode (reinforcement learning) <25> 
One full training cycle from the agent’s first action to the envi- 
ronment’s final reward. 


epoch (training) <8> 
During training, a single pass through the training data. The 
training data is usually provided to the system in a different 
order in each epoch. 


epsilon (€) (reinforcement learning) <25> 
A parameter for Q-learning. See epsilon-greedy. 


epsilon-greedy (reinforcement learning) <25> 
An algorithm used by Q-learning to select an action. The 
algorithm guides the choice between a known action which 
returns good rewards, and a new action whose results are 
not yet known. The choice is governed by a parameter usually 
written as € (epsilon). 


epsilon-soft (reinforcement learning) <25> 
See epsilon-greedy. 


error <1> 
This usually refers to the discrepancy between the system’s 
output and the desired results. 
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eta (1)) (backpropagation) <18> 
A frequent choice to represent the learning rate. See learning 
rate. 

evaluation (reasoning) <11> 
The process of determining the quality of performance of a 
system. 

event (information theory) <6> 
An object with information that we would like to represent. 


evidence (Bayes’ Theorem) <4> 
The probability of event B occurring, or P(B), used when cal- 
culating P(A|B). 

expected value (statistics) <2> 
A statistical term referring to the value we expect from a ran- 
dom variable, taking into account the distribution of values it 
is drawn from. It can be thought of as the average value of that 
variable after drawing it a large number of times. 

expense (information theory) <6> 
The cost of sending a message with a given representation. 


experience replay (reinforcement learning) <25> 
The process of running a system through a recorded series of 
actions, while processing the rewards from those actions. 
expert system <1> 
An algorithm that simulates the decisions of a human expert. 
Usually this is done by applying explicit rules created by 


human experts that are designed to capture their deci- 
sion-making process. 


expert’s label (classification) <7> 
A class label assigned to a sample by a human expert. 
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explainability (deep learning) <20> 
A description of how well we are able to understand why a 
learner made a particular prediction. 


exploding gradient (RNNs) <22> 
A problem that can occur during the training of an RNN, when 
the gradient at a particular weight increases rapidly. This can 
make it difficult to adjust and improve the weight. 


explore or exploit dilemma (reinforcement learning) <25> 
When an agent selects an action, it can choose to try out 
actions that are untried or only infrequently selected, or 
instead rely on well-known actions that are likely to return 
known rewards. 


exponential decay (optimizers) <19> 
A curve with a shape given by the mathematical operation of 
raising a fixed number to the value of —1 times the input times 
a fixed value called the decay parameter. 


exponential ReLU (activation functions) <17> 
A smooth and continuous activation function that is like a 
smoothed-out version of a shifted ReLU. 


ExtraTrees (ensembles) <14> 
A variation on random forests where a node’s splitting point is 
chosen at random. 


f 
fl score (probability) <3> 


A single value that combines precision and recall. 

fair (Bayes’ Theorem) <4> 
The nature of a coin that is equally balanced for both heads 
and tails. 
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fallacy (reasoning) <11> 
An error resulting from incorrect logical reasoning. 


fallacy of exclusion (reasoning) <11> 
Another name for the fallacy of overwhelming exception. 


false negative (probability) <3> 
In a classification problem, a sample that should have been 
classified as true, but was assigned false. 


false negative rate (probability) <3> 
The percentage of true samples that were incorrectly labeled. 


false positive (probability) <3> 
In a classification problem, a sample that should have been 
classified as false, but was assigned true. 


false positive rate (probability) <3> 
The percentage of false samples that were incorrectly labeled. 


fan-in (neural networks) <21> 
The number of inputs arriving at an artificial neuron. 


faulty generalization (reasoning) <11> 
A fallacy of logical induction where a conclusion is drawn 
without enough samples to back it up. 


FC (deep learning) <20> 
See fully connected. 


feature (machine learning) <1> 


An individual piece of data that collectively makes up a sample. 


feature <1> 


A particular pattern or structure in the data of an input tensor 
that causes a large response from a particular filter. The act of 
training a convolution layer involves manipulating the weights 
in filters to discover useful features. 


1689 


Chapter 30: Glossary 


feature (image processing) <1> 
An image structure that is usually recognizable by a human 
observer, such as an edge, shape or repeating structure. 


feature bagging (ensembles) <14> 
A variation on bagging. When we split a node, rather than con- 
sider the quality of splitting on all features, we examine only a 
random subset of them. 


feature detector (CNNs) <21> 
Another name for a filter. 


feature engineering <1> 
The process of using information about a collection of data 
to create features that improve the performance of a learner. 
This term usually refers to the manual production of such fea- 
tures by one or more people. The automatic process is known 
as feature learning. 


feature filtering (data prep) <12> 
Another name for feature selection. 


feature learning <1> 
The process of automatically processing a collection of data to 
create features that improve the performance of a learner. The 
same process carried out by hand is called feature engineering. 


feature map (CNNs) <21> 
In a convolutional layer, the collected results of applying a fil- 
ter to an input tensor. 


feature selection (data prep) <12> 
The removal of redundant or unnecessary features in the sam- 
ples of a data set. 


featurewise processing (data prep) <12> 
A transformation that collects together all the values of a given 
feature from all samples in a data set, analyzes them, and 
transforms them as a group. 
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feed-backward (feedforward networks) <16> 
The flow of data in a network when it travels from the outputs 
towards the inputs. 


feed-forward (feedforward networks) <16> 
The flow of data in a network when it travels from the inputs 
towards the outputs. 


feedback (mathematics) <1> 
Applying a value produced inside, or at the output, of a system, 
perhaps in a modified form, to an earlier part of the same sys- 
tem. It can also mean using the system’s output to modify the 
system itself. 


feedback (reinforcement learning) <25> 
Another name for the reward signal. 


filter <20> 
A tensor of numbers that is moved over the input during 
convolution. 


final reward (reinforcement learning) <25> 
Another name for the ultimate reward. 


finite (statistics) <2> 
A mathematical property of a collection of elements. Casually, 
it means that if we count the elements, we will eventually 
reach the end, with a specific number representing how many 
elements there are. 
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fit (statistics) <2> 
A verb for the mathematical operation of analyzing some 
collection of data in order to determine the parameters of a 
representation of that data. Often we search for the best val- 
ues of those parameters given some criteria. For example, we 
can fit a straight line to some 2D data, meaning that we find 
the line that comes closest to all the points. Or we can fit a 
neural network to a data set, meaning that we find the values 
of the weights that give us the best results for questions about 
that data. 


fixed-length code (information theory) <6> 
See constant-length code. 


flat (statistics) <2> 
A curve or surface that has a constant value. Sometimes this 
only refers to the non-zero portion of a curve or surface. For 
example, a surface might be zero outside of some region, 
while inside that region it has a value of 1. 


flatten layer (deep learning) <20> 
A helper layer that is a special case of a reshaping layer. The 
elements of the input tensor are reshaped into a 1-dimen- 
sional list. 


fold (training) <8> 
During cross-validation, one of multiple equal-sized chunks of 
training data. Usually a model is trained with all but one fold 
of data, with that remaining fold used to evaluate its perfor- 
mance on new data. 


footprint (CNNs) <21> 
Another name for local receptive field. 


forgetting (RNNs) <22> 
The process of allowing some or all elements of an RNN unit’s 
internal memory to move partly or wholly towards o. 
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fractional striding (CNNs) <21> 
Another name for transposed convolution. 


free-running (reinforcement learning) <25> 
An agent or environment that works or changes even without 
a signal from the other party. 


freeze (neural networks) <26> 
Prevent the weights in a layer from being updated during 
training. 


frequentism (Bayes’ Theorem) <4> 
A school of thought in probability that says that to find the 
most likely outcome of an experiment one should run that 
experiment many times and choose the most frequently-oc- 
curring result. 


frequentist (Bayes’ Theorem) <4> 
Someone who prefers the probability framework of 
frequentism. 


full modal collapse (GANs) <26> 
When the generator in a GAN produces the same output every 
time. 


full observability (reinforcement learning) <25> 
A policy that provides an agent with access to all parameters 
that describe the state of the environment. 


fully-connected layer (deep learning) <20> 
A layer of artificial neurons in which every neuron receives 
input from every neuron on the preceding layer. Also called a 
dense layer. 


fully-connected network (deep learning) <20> 
A deep learning network dominated by fully-connected layers. 
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function (statistics) <2> 


An object that accepts one or more values as inputs, and pro- 
duces one or more values as outputs. In casual use, we often 
also speak of the outputs as the function. For example, a func- 
tion might take in two numbers and produce a new number 
as output. The collection of output values is then sometimes 
referred to as the function itself. 


function (programming) <15> 
A piece of code that accepts inputs and produces outputs. The 
term is also sometimes used informally even when a routine 
takes no inputs, returns no outputs, or both. 

Functional API (Keras) <23> 


The collection of objects, methods, and functions in Keras 


designed for models with structures that may not form a sin- 
gle stack of layers. 


G 


game theory (GANs) <26> 
A field of study that investigates both real and theoretical 
games, or competitions, and strategies for doing well at them. 
GAN <26> 
See generative adversarial network. 


gate (RNNs) <22> 


A processing unit that takes a list of values and a list of 


weights, and multiplies each value by its corresponding 
weight. 


gated recurrent unit (RNNs) <22> 
An RNN unit that is a simplified version of an LSTM. 
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Gaussian distribution (statistics) <2> 
A probability distribution that mostly has a value of nearly 
O, except for a smooth hill-shaped region. It is usually char- 
acterized by its mean (the center of the hill) and its standard 
deviation (the width of the hill). 


generalization (reasoning) <11> 
A principle of logical reasoning that states if a sample set is 
representative of the population, then the properties of the 
sample set will also hold for the population. 


generalization <1> 


How well a learner performs after deployment, when exposed 
to new data. 


generalization accuracy (overfitting) <9> 
See generalization error. 


generalization error <1> 
The error in the output of a learner when it processes new 
data after being deployed. Also called generalization loss. 


generalization loss (overfitting) <9> 
See generalization error. 


generative adversarial network <26> 
A strategy for teaching two networks to simultaneously detect 
problems in each other’s work, forcing each network to 
improve its performance. When completed, each network can 
be used independently. One network can identify whether a 
piece of data belongs to the data set the networks were trained 
on, the other can generate new data that is similar to, but dif- 
ferent from, the examples in that data set. 


generator (GANs) <26> 


The network that produces new data that is intended to be sta- 
tistically similar to the examples in a given data set. 
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generator <1> 
An algorithm that produces new data. Often that data is 
intended to be different from, but statistically similar to, a 
given data set. 


Gini Impurity (classifiers) <13> 
A value used in decision trees to characterize the quality of 
a proposed splitting of a node. It provides a measure of how 
likely the proposed split is to misclassify a new sample. We 
typically choose the split with the least chance of error. 


global maximum (curves and surfaces) <5> 
The largest value anywhere on a curve or surface. 


global minimum (curves and surfaces) <5> 
The smallest value anywhere on a curve or surface. 


Glorot normal initialization (feedforward networks) <16> 
A method for initializing every weight in a neural net- 
work with a value based on values drawn from a normal (or 
Gaussian) distribution. 


Glorot uniform initialization (feedforward networks) <16> 
A method for initializing every weight in a neural network with 
a value based on values drawn from a uniform distribution. 


GPU <i> 
See graphics processing unit. 


gradient (curves and surfaces) <5> 
Given a point on a surface, an arrow that points in the direc- 
tion where the surface’s value is increasing the most. 


gradient ascent (curves and surfaces) <5> 
The act of moving along a surface by following the gradient at 
each point, generally with the intent of finding a local maxi- 
mum value of some function. 
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gradient descent (curves and surfaces) <5> 
The act of moving along a surface by following the negative 
gradient at each point, generally with the intent of finding a 
local minimum value of some function. 


graph theory (feedforward networks) <16> 
The mathematical field that studies the properties of graphs. 


graphics processing unit <1> 
A chip originally designed to accelerate the rendering of 
images by computer graphics. The math used to run deep 
learning can be structured like the math that GPUs were 
designed to perform, greatly improving the speed of training 
and predicting. 


greedy (classifiers) <13> 
An algorithm that makes decisions based on the data it has at 
the moment. The hope is that local decisions will end up find- 
ing a local minimum or maximum of a function. 


grid (deep learning) <20> 
A data structure shaped as a rectangle. It can also be thought 
of as a 2D tensor. 


ground truth (probability) <3> 
In problems where we are trying to approximate or derive a 
value, this is the known, correct value we’re trying to match. 


GRU (RNNs) <22> 
See gated recurrent unit. 


-| 


halting problem (reasoning) <11> 
A thought experiment in which we ask if a computer, given 
a particular program and input, will ever stop. It has been 
proven that no algorithm can be created that can answer this 
question for every program and input. 
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hasty generalization (reasoning) <11> 
A fallacy of logical induction that comes from drawing a con- 
clusion before collecting a representative set of data. 


He normal initialization (feedforward networks) <16> 
A method for initializing every weight in a neural network 
with a value based on a normal (or Gaussian) distribution. 


He uniform initialization (feedforward networks) <16> 
A method for initializing every weight in a neural network 
with a value based on a uniform distribution. 


Heaviside step (activation functions) <17> 
A unit step function with a threshold of o. 


helper layer (deep learning) <20> 
A layer that does not contain artificial neurons, but imple- 
ments an algorithm that operates on one or more such layers, 
or the data going into or out of them. 


Helvetica scenario (GANs) <26> 
See full modal collapse. 


hidden layer (backpropagation) <18> 
A layer in a deep learning network that is between the input 
and output layers. 


hierarchy (deep learning) <20> 
A series of layers which operate on larger and larger pieces of 
the input, enabling analyses of the data from a fine level to a 
broader scale. 


hierarchy (graphs) <16> 
A structure of elements, often like a tree, where higher nodes 
either exercise control over lower nodes, or aggregate the 
information coming from them. 
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high-dimensional space (statistics) <2> 
A conceptual environment where data with multiple values 
can be represented. The term is usually applied for more than 
3 dimensions. It is difficult to picture such spaces, but in many 
cases we can reason about them by analogy with more familiar 
1-, 2-, and 3-dimensional spaces. It’s important to verify those 
analogies, as high-dimensional spaces may have unintuitive 
and surprising properties. 


Hume’s fork (reasoning) <11> 
The argument that rationalist ideas borne of pure logic are 
preferable to empirical observations. 


hyperbolic tangent (activation functions) <17> 
See tanh. 


hyperparameter <1> 
A parameter, or value, that is set before the start of learn- 
ing. It differs from other parameters by remaining fixed, or 
unchanged, during learning. 


hyperparameter tuning (classification) <7> 
The process of trying out different values of one or more 
hyperparameters in a search for the best results. 


hypothesis (Bayes’ Theorem) <4> 
A statement put forth as possibly true, and then subjected to 
investigation. 


hypothesis (deep learning) <1> 
The state of an architecture and its weights. 


i.i.d. (statistics) <2> 
See independent and identically distributed. 
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identity activation function (backpropagation) <18> 
The identity function used as an activation function. 


identity function (activation functions) <17> 
The function that returns the input as output. 


illicit major (reasoning) <11> 
A syllogistic fallacy. In schematic form, it incorrectly asserts 
that because 1) All A are B, and 2) No Cis A, therefore 3) No C 
are B. 


illicit minor (reasoning) <11> 
A syllogistic fallacy. In schematic form, it incorrectly asserts 
that because 1) All A are B, and 2) All A are C, therefore 3) All 
Care B. 


image processing (CNNs) <21> 
The field of study devoted to analyzing and understanding 
images. 


impure (decision trees) <13> 
A node whose contents are not sufficiently similar under some 
criterion. 


inceptionism (creative applications) <28> 
See deep dreaming. 


independent (statistics) <2> 
A variable whose value is not influenced by any other variables. 


independent and identically distributed (statistics) <2> 
A set of random variables that are drawn from the same distri- 
bution, but are otherwise not related. 
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induction (reasoning) <11> 
The method of logical reasoning that starts with observations. 
As the observations arrive, they are examined for patterns. 
When a pattern seems particularly well established, we can 
consider it a hypothesis and apply deductive reasoning. 
Induction is often described as a bottom-up approach, since it 
starts with data and works its way up to hypotheses. 


inertia (optimizers) <19> 
A physical property of objects that roughly says that an object 
in motion will tend to continue that motion unless interfered 
with. 


infinite (statistics) <2> 
A mathematical property of a collection of elements. Casually, 
it means that if we count the elements, we will never reach the 
end, or run out of elements. 


information (information theory) <6> 
A measure of surprise in a signal, based on the elements in the 
signal and a probability distribution of those elements. 


information gain (decision trees) <13> 
A value used to characterize the quality of a proposed splitting 
of a node. It compares the summed entropy of all children of 
the node to the entropy of the node itself. The split that pro- 
vides the greatest reduction in entropy is typically chosen. 


information leakage (data prep) <12> 
An error in the learning process where the learner is exposed 
to some information about, or contain within, the test or val- 
idation data. The result can be a misleading estimate of the 
model’s performance. This leakage often occurs in a subtle 
way. 


information theory (information theory) <6> 
The field of study that investigates the properties of messages, 
their representations, and their communication. 
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input layer (backpropagation) <18> 
The first layer of a deep network, where input values are held. 
Because there are no neurons or other processing elements 
in the input layer, it is often omitted from drawings and not 
included when counting the number of layers in the network. 


instrumental conditioning (reasoning) <11> 
Another name for operant conditioning. 


integer (statistics) <2> 
A whole number, or a number without a fractional part. 


interpolation (autoencoders) <24> 
A synonym for blending. 


interval decay (optimizers) <19> 
A decay schedule where the decay parameter is modified only 
after a certain number of epochs have elapsed. 


invalid (reasoning) <11> 


A categorical syllogism that violates logic when deriving its 
conclusion. 


inverse transformation (data prep) <12> 
A transformation that reverses the operation of some other 
transformation. Generally, if we apply a transformation to a 
sample, and then apply the inverse transformation, we get 
back our original sample. 


iris dataset (Keras) <23> 
A popular, small dataset describing the physical characteris- 
tics of 3 different types of iris flowers. 
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J 


joint probability (probability) <3> 
The probability that two statements are simultaneously true. 
If the statements are A and B, the probability that both are 
true is written as P(A,B). 


Jupyter (Keras) <23> 
A browser-based environment for developing programs in 
dozens of programming languages, including Python. 


K 


k-fold cross-validation (training) <8> 
Cross-validation where the training data has been broken up 
into a number of folds given by the value of k. For example, 
5-fold cross-validation uses 5 folds of data. The phrase “k-fold 
cross-validation” refers to this technique without specifying 
the value of k to be used. 


k-means (scikit-learn) <15> 
An unsupervised learning algorithm that groups objects into k 
clusters. 


k-nearest neighbors (classifiers) <13> 
A non-parametric classifier. Given a sample and a value for k, 
find the k samples that are closest to that point, and return the 
label that is most popular among those samples. 


Keras (Keras) <23> 
An open-source library for deep learning, written in Python. 


kernel (deep learning) <20> 
Another name for a filter. 


kernel (support vector machines) <13> 
A piece of math at the heart of the algorithm. 
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kernel trick (classifiers) <13> 
A mathematical technique that allows a support vector 
machine to find distances between transformed points without 


actually transforming them, greatly improving the algorithm’s 
speed. 


KL divergence (information theory) <6> 
See Kullback—Leibler divergence. 


KLIC (information theory) <6> 
See Kullback—Leibler divergence. 


kNN (classifiers) <13> 
See k-nearest neighbors. 


Kullback-Leibler divergence (information theory) <6> 
Given two probability distributions, a measure of how differ- 
ent they are. When the value is 0, the two distributions are the 
same. As the value increases, the distributions are increas- 
ingly different. 


L 


L-learning (reinforcement learning) <25> 
A lousy reinforcement learning algorithm. This imaginary 
algorithm is used in the book only as a device to introduce 
Q-learning. 


L-table (reinforcement learning) <25> 


A table used by L-learning to save a score for every action in 
every situation. 


L-value (reinforcement learning) <25> 
A value saved in an L-table. 


label <1> 
The category assigned to a sample. 
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lambda (A) (backpropagation) <18> 
Often refers to the strength of regularization applied during 
backpropagation. See regularization term. 


latent layer (autoencoders) <24> 
The layer where the latent variables are computed. 


latent variable (autoencoders) <24> 
A variable learned by an autoencoder with which to represent 
the input. 


layer (deep learning) <1> 
A grouping of artificial neurons. Typically, neurons in a given 
layer receive their inputs, and send their outputs, only from or 
to neurons on other layers. 


lazy (classifiers) <13> 
An algorithm that defers some types of work until they are 
necessary. 


leaf (decision trees) <13> 
Another name for a terminal node. 


leaky ReLU (activation functions) <17> 
A piecewise linear activation function that has a line with a 
slope of 0.1 to the left of 0, and the identity function to the 
right of o. 


learner <1> 
A machine-learning model. 


learning <1> 
The training process where the system analyzes input data to 
draw information from it. In deep learning, the system modi- 
fies its parameters over time to develop a representation of the 
input data. 
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learning rate (deep learning) <1> 
A scaling factor use to control the amount of change applied to 
a weight during an update step. Often represented by the low- 
er-case Greek letter n (eta). 


learning rate (reinforcement learning) <25> 
A parameter used by Q-learning to control how to combine the 
old and new values of a Q-score during the update process. It 
is often written with the lower-case Greek letter a (alpha). 


LeCun normal initialization (feedforward networks) <16> 
A method for initializing every weight in a neural net- 
work with a value based on values drawn from a normal (or 
Gaussian) distribution. 


LeCun uniform initialization (feedforward networks) <16> 
A method for initializing every weight in a neural network with 
a value based on values drawn from a uniform distribution. 


likelihood (Bayes’ Theorem) <4> 
The conditional probability P(B|A) that is used in Bayes’ Rule 
as part of the computation of the result P(A|B). 


limited observability (reinforcement learning) <25> 
When an agent has access to only some of the parameters that 
describe the state of the environment. 


linear correlation (statistics) <2> 
Two variables with perfect positive or negative correlation. 


linear curve (activation functions) <17> 
A straight line. 


linear function (activation functions) <17> 
A 2D function whose output, when graphed, forms a straight 
line. In higher dimensions, a function whose output is found 
by multiplying its inputs by constants and then adding the 
results. 
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linear regression <1> 
The process of finding the best straight line that fits a given set 
of data. 


linearly separable (ensembles) <14> 
A data set with two classes that can be separated with a linear 
surface in the space of the data. In 2D, the elements would 
be separable with a line. In 3D, they can be separated with a 
plane. 


local context (information theory) <6> 
The parts of a signal soon before (and perhaps soon after) a 
particular part of interest. 


local maximum (curves and surfaces) <5> 
The largest value of a curve or surface in a given region, or 
near a given point. 


local minimum (curves and surfaces) <5> 
The smallest value of a curve or surface in a given region, or 
near a given point. 


local receptive field (CNNs) <21> 
The elements of the input tensor that are combined with a 
given filter. 


logistic curve (activation functions) <17> 
Another name for a sigmoid curve. 


logistic function (activation functions) <17> 
Another name for a sigmoid curve. 


long-term dependency problem (RNNs) <22> 
The observation that no matter how much memory we pro- 
vide to a simple RNN unit, we can always devise an input that 
would require more. 
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loop (feedforward networks) <16> 
A flow of information through a graph that allows a node to 


receive as input information that at any point was influenced 
by its output. 


loss (deep learning) <1> 
Another name for a network’s error. 


lossless encoding (autoencoders) <24> 
An encoding of data from one format to another in which 
no information is lost. This means the information can be 
decoded in order to exactly retrieve the original version. 


lossy encoding (autoencoders) <24> 
An encoding of data from one format to another in which 
information is lost, usually because we are using a smaller 
representation in order to compress the input. When decoded, 
the data will usually be similar to, but not the same, as the 
version it started out as. 


LSTM (RNNs) <22> 
The architecture of a particular type of unit in a an RNN. It 
contains local memory, which is controlled by multiple inter- 
nal neural networks. The networks develop their weights 
during training, and then those weights are frozen. The inter- 
nal memory, though, can be modified after deployment. This 
allows the unit to remember past samples and use their values 
to influence its output on newer samples. 


M 


machine learning <1> 
The process of using algorithms to analyze a set of data, with 
the intent of deriving useful information. Often the result of 
machine learning is a trained system that can be used to pro- 
cess and describe new data that is similar to, but different 
from, the data used for training. 
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magnitude (curves and surfaces) <5> 
With respect to a vector like the gradient, the length or size of 
the vector. 


major premise (reasoning) <11> 
The first statement in a standard-form categorical syllogism. 


many to many (RNNs) <22> 
An RNN structure where a sequence of inputs produces a 
sequence of outputs. 


many to one (RNNs) <22> 
An RNN structure where a sequence of inputs produces a sin- 
gle output. 


mapping (data prep) <12> 
A correspondence between the elements in one data set, and 
the elements in another. This can be considered a way to rep- 
resent a transformation. 


margin (classifiers) <13> 
In a support vector machine, the distance from a boundary to 
a sample. 


marginal probability (probability) <3> 
The probability that some statement is true. If the statement 
is A, then the marginal probability is written as P(A). 


matrix (deep learning) <20> 
A data structure formed in a rectangle. It can also be thought 
of as a 2D tensor. 


maxima (optimizers) <19> 
The plural of maximum. 


maximum ascent (curves and surfaces) <5> 
The direction in which the value of a curve or surface is 
increasing the most. 
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maximum descent (curves and surfaces) <5> 
The direction in which the value of a curve or surface is 
decreasing the most. 


maxout (activation functions) <17> 
A piecewise linear activation function that is based on a collec- 
tion of straight lines. At each value of input, the height of each 
line is found, and the most positive of those values is returned. 


mean (statistics) <2> 
The most common kind of “average” value for a list of num- 
bers, given by their sum divided by the number of values in 
the list. 


median (statistics) <2> 
The value in the middle of a sorted list of numbers. 


mellowmax (reinforcement learning) <25> 
A method for adjusting the values in a row of a Q-table after 
one of the Q-values has been updated. 


middle term (reasoning) <11> 
A component of both the subject and predicate in a categorical 
syllogism. 


mini-batch (optimizers) <19> 
A piece of a data set. Often large training, validation, and 
test sets are broken up into many of these smaller pieces. 
Frequently, their size is a power of 2, to better allow calcula- 
tions on a GPU. 


mini-batch gradient descent (optimizers) <19> 
Applying the gradient descent algorithm to mini-batches of 
data. Typically, after each batch backpropagation and then 
update are applied to improve the weights. 


minima (optimizers) <19> 
The plural of minimum. 
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minimax (GANs) <26> 
An algorithm for choosing the best move in a game involving 
more than one player. The technique chooses the move that 
best maximizes the benefit for the player making the move, 
while also minimizing the benefit of the best possible moves 
then made available to the other players. 


minor premise (reasoning) <11> 
The second statement in a standard-form categorical 
syllogism. 


misleading vividness (reasoning) <11> 
A fallacy of logical induction that comes about when our 
senses seem to present an inescapable, but incorrect, 
conclusion. 


MLP (deep learning) <20> 
See multi-layer perceptron. 


MNIST <1> 
MNIST stands for Modified NIST (NIST itself stands for US 
National Institute of Standards). The MNIST data set con- 
tains 60,000 grayscale images of size 28 by 28. Each image 
is a handwritten digit from 0 to 9. Half of the digits came 
from high school students and half from employees at the US 
Census Bureau. It is one of the most popular training datasets 
for small experiments in deep learning. 


modal collapse (GANs) <26> 
When the generator in a GAN produces the same output, or 
one of a few outputs, every time. 


mode (statistics) <2> 
The value that occurs most frequently in a list of numbers. 
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model <1> 
This term has multiple meanings. It can refer to the architec- 
ture, or structure, of a learner. It can refer to the parameters 
that the learner has determined as a result of training. It can 
also refer to both the architecture and parameters together. 


momentum (optimizers) <19> 
A physical value associated with a moving body due to its 
mass and velocity. In deep learning, the term is often used to 
refer to incorporating some of the change of an object (such as 
a weight) from a previous step into the current step. 


momentum gradient descent (optimizers) <19> 
An optimization algorithm that updates a weight using the 
change calculated by the current update step, plus a little bit 
of the change that was applied during the previous step. The 
term is often used to represent the physical property of inertia. 


momentum scaling factor (optimizers) <19> 
A value that controls how much of the previous change is 
added to a weight when performing momentum gradient 


descent. It is often represented by the lower-case Greek letter 
y (gamma). 


multi-class classifier (classification) <7> 
A classification algorithm for data with more than 2 classes. 


multi-layer perceptron (deep learning) <20> 
A network made up of fully-connected layers. 


multinoulli distribution (statistics) <2> 
A generalization of the Bernoulli distribution. Suppose a ran- 
dom variable can take on any one of N values, numbered o to 
N-1. The random variable is represented by a list of N values, 
all of which are set to 0 except for a 1 at the entry correspond- 
ing to the variable’s value. Also see one-hot encoding. 
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multiple correlation (statistics) <2> 
A correlation between more than 2 variables. 


multiresolution searching (Keras) <23> 
The process of searching for values using a coarse sampling of 
the space, and then searching again with increasingly dense 
sampling in smaller and smaller regions. 


multivariate transformation (data prep) <12> 
A transformation that collects statistics about two or more 
features in a data set (perhaps all of them), and uses those sta- 
tistics collectively to transform all the measured features. 


N 


Nash equilibrium (GANs) <26> 
The state of a game where no player can gain advantage if 
the other players follow their best strategies. In a GAN, this 
means that both networks have become as good as they can 
get with the data available to them. 


Naive Bayes (classifiers) <13> 
A fast and simple classifier. The algorithm starts with an 
assumed prior, usually determined even without looking at 
the data itself (and thus is an uninformed, or naive, prior). 
Often this prior assumes the data has a Gaussian distribution. 
Then the algorithm classifies data as though that prior was 
true. 


negative covariance (statistics) <2> 
When two variables have the property that as the value of one 
variable increases, the value of the other decreases. 


negative gradient (curves and surfaces) <5> 
Given a point on a curve or surface, the direction opposite the 
gradient at that point. 
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negative punishment (reasoning) <11> 
In operant conditioning, removing a pleasant stimulus from a 
situation to increase the frequency of a desired behavior. 


negative reinforcement (reasoning) <11> 
In operant conditioning, removing an unpleasant stimulus 


from a situation to decrease the frequency of an undesired 
behavior. 


neighborhood (curves and surfaces) <5> 


A region near a given point. The meaning of “near” depends 
on the context. 


neighbor (classifiers) <13> 


In k-nearest neighbors, a sample that is close to a given 
sample. 


Nesterov accelerated gradient (optimizers) <19> 
See Nesterov momentum. 


Nesterov momentum (optimizers) <19> 
An optimization algorithm that starts like momentum gradi- 
ent descent. Once the new update and the old scaled update 
have been added to the weight, the algorithm finds the gra- 
dient of the error curve at that computed weight value. Some 
of the gradient at that point is added into the update of the 
weight as well to produce the final value. 


neural network (neurons) <10> 
A learning system built from artificial neurons. 


neuron (neurons) <10> 
In neuroscience, any one of a class of cells that are associated 
with the central nervous system. Neurons are often considered 


the essential building block of the brain, and thus of conscious 
thought. 
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neuron (deep learning) <20> 
In deep learning, a neuron is a small bit of computation that 
was originally inspired by biological neurons, but is vastly 
simpler. See perceptron. 


No Free Lunch Theorem (optimizers) <19> 
The proven assertion that no optimizer can out-perform every 
other optimizer in every possible situation. 


node (classifiers) <13> 
A branching point in a decision tree, or one of the leaves at the 
bottom. 


noise <1> 
Generally speaking, noise is unpredictable data. Often the 
term refers to any kind of information added to data that dis- 
torts that data. Usually noise is random (or stochastic), or else 
is largely unpredictable in a practical sense. The statistics of 
the noise may be known. 


noise layer (deep learning) <20> 
A helper layer that adds random noise to its input data. 


noise reduction <1> 
The process of reducing or removing noise from a data set (or 
individual sample). 


noisy ReLU (activation functions) <17> 
A ReLU function with random noise added to its output. The 
result may be discontinuous, un-smooth, or both. 


nominal data (data prep) <12> 
Categorical data without a natural ordering. 


non-linearity (mathematics) <17> 
Any operation that is not linear. 


non-linearity (neural networks) <17> 
Any activation function that is not a linear function. 
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non-linearly correlated (statistics) <2> 
A correlation between two variables that is not linear. 


non-parametric classifier (classifiers) <13> 
A classification algorithm that is not parametric. That is, it 
does not have a predetermined representation of data whose 
controlling values are tuned in response to training. See para- 
metric classifier. 


normal deviates (statistics) <2> 
A set of values that follow a normal, or Gaussian, distribution. 


normal distribution (statistics) <2> 
See Gaussian distribution. 


normal initialization (neural networks) <16> 
Setting every weight in a neural network to the value of a ran- 
dom variable drawn from a normal (or Gaussian) distribution. 


normalization (data prep) <12> 
The process of transforming a collection of numbers so that 
they span a given range, usually [0,1] or [-1,1]. 


normalization layer (deep learning) <20> 
A helper layer that processes its incoming data, typically either 


to normalize it to the range [0,1] or [-1, 1], or to normalize and 
then standardize it. 


normalized (statistics) <2> 
A set of numbers which, considered as a vector, has a length of 
1. In practice, it means that if every value in the list is squared 
and the results summed, that sum will be 1. In some situa- 
tions, we say that a list of values is normalized if they add up 
to 1, without squaring them first. 


normally distributed (statistics) <2> 
A set of values that follow a normal, or Gaussian, distribution. 
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O 


observation (Bayes’ Theorem) <4> 
The outcome of an event. 


offline algorithm (optimizers) <19> 
An algorithm that requires all information it might need to use 
to be present before it starts. 


one to many (RNNs) <22> 


An RNN structure where a single input produces a sequence of 
outputs. 


one to one (RNNs) <22> 


An RNN structure where a single input produces a single 
output. 


one-against-all (classification) <7> 
See one-versus-rest. 


one-hot encoding (data prep) <12> 
A representation of a piece of data that can only take on one of 
a finite number of values. We create a list of 0’s that is as long 
as there are possible values for the variable, and we assign 
each possible value a position in that list. Then we place a sin- 
gle 1 in the position corresponding to the value of the variable. 


one-versus-all (classification) <7> 
See one-versus-rest. 


one-versus-one (classification) <7> 
A multi-class classification algorithm where each pair of 
classes is given its own binary classifier. A new sample is then 


run through every classifier, and assigned the class value that 
is most frequently reported. 
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one-versus-rest (classification) <7> 
A multi-class classification algorithm where each class 
receives its own binary classifier. When presented with a new 
sample, each classifier returns a confidence that the sample is 
part of that class. The sample is assigned the class correspond- 
ing to the classifier with the highest confidence. 


online algorithm (optimizers) <19> 
An algorithm that can process new information that arrives as 
the algorithm is running. 


operant conditioning (reasoning) <11> 
In behaviorism, an approach to providing or withholding 
rewards or punishments to cause an individual to perform a 
particular action more or less frequently. 


optimal (reasoning) <11> 
A name for a system that is at its optimum, or the absolute 
peak of its possible performance. An optimal system cannot 
be improved upon. 


optimization (reasoning) <11> 
The process of improving a system, or moving it towards its 
optimum. 


optimum (reasoning) <11> 
The state of a system that is exhibiting its best possible 
performance. 


ordinal (data prep) <12> 
Categorical data with a built-in ordering. 


outlier (overfitting) <9> 
A data point that seems markedly different from most of the 
rest of the data. Outliers are often treated with suspicion of 
being errors during data collection or recording. 


1718 


Chapter 30: Glossary 


output layer (backpropagation) <18> 
The final layer in a deep network. The outputs of this layer are 
the outputs for the network as a whole. 


overfitting (overfitting) <9> 
The phenomenon where a learner’s performance on the train- 
ing data is increasing and the loss is decreasing, while the 
error on the validation data is increasing. Generally speaking, 
a learner in this condition is memorizing idiosyncratic details 
in the training set that do not generalize, causing reduced per- 
formance on new data. 


overlapping windows (RNNs) <22> 
Equal-sized pieces of input data that share elements in com- 
mon. The pieces are usually of the same fixed size, and equally 
spaced in the input. 


overwhelming exception (reasoning) <11> 
An induction fallacy that comes from arguing away explana- 
tions that we don’t like. 


p 


padding (CNNs) <21> 
See zero-padding. 


paging (optimizers) <19> 
A method for memory management that attempts to keep use- 
ful information in high-speed memory. 


parameter <1> 
A value that controls the operation of an algorithm. In 
machine learning, parameters are often modified and 
improved through training. 
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parametric classifier (classifiers) <13> 
A classifier that starts with a predetermined model for rep- 
resenting data. That model is entirely described by its form 
(usually mathematical) and a set of numbers, called parame- 
ters. When exposed to data, the classifier seeks to find the best 
values of the parameters to describe the data. For example, in 
linear regression we start with the idea that the data can be 
represented by a straight line. Every 2D line can be described 
with three values. During training, the algorithm seeks the 
values for these three parameters that best describe the data 
it’s seen. 


parametric blending (autoencoders) <24> 
Blending two objects by interpolating their underlying 
parameters. 


parametric ReLU (activation functions) <17> 
A leaky ReLU where we can choose the slope of the line affect- 
ing negative values. 


parent (decision trees) <13> 
The node immediately above a given node. The root has no 
parent. 


partial modal collapse (GANs) <26> 
When the generator in a GAN produces one of a few particular 
outputs every time. 


partial observability (reinforcement learning) <25> 
Another name for limited observability. 


PCA (data prep) <12> 
See principal components analysis. 


pdf (statistics) <2> 
See probability distribution function. 
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penalty (reasoning) <11> 


In operant conditioning, another name for negative 
punishment. 


perceptron <1> 
An early and highly abstracted mathematical model of a bio- 
logical neuron. It accepts a set of input values, multiplies each 
by an associated weight, sums the results, and then applies a 
threshold function to the summed value to produce an output. 
Today the term usually refers to a generalization of this model 
that includes a bias term, and replaces the threshold function 
with any of a wide variety of alternatives, called activation 
functions. Sometimes modern artificial neurons are still called 
“perceptrons,” though they are more sophisticated than the 
perceptron as originally published. 


perfect negative correlation (statistics) <2> 
A description of two variables with a correlation of -1. 


perfect positive correlation (statistics) <2> 
A description of two variables with a correlation of +1. 


perturbation (CNNs) <21> 
See adversarial perturbation. 


piecewise linear (activation functions) <17> 
A non-smooth function that is made up of multiple straight- 
line segments. It is typically continuous. 


pipeline (scikit-learn) <15> 
An object in the scikit-learn library that packages up multiple 


steps. The pipeline may be parameterized, easing the process 
of searching for the best hyperparameters for a system. 


plateau (curves and surfaces) <5> 
A region where a curve or surface is flat. 
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plurality voting (ensembles) <14> 
An election scheme where the candidate with the most votes is 
declared the winner. 


policy (reinforcement learning) <25> 
An algorithm used by an agent to select an action from a list of 
possibilities. 


polynomial (scikit-learn) <15> 
A curves represented by multiple powers and combinations of 
their parameters. 


pooling (deep learning) <20> 
A particular form of downsampling where small regions, typi- 
cally 2 by 2 blocks, are replaced by a single value. Often this is 
the average or maximum value in the block. 


pooling layer (deep learning) <20> 
A helper layer that implements downsampling. 


population (statistics) <2> 
A collection of objects. We typically study a population to find 
its statistical properties. 


positive covariance (statistics) <2> 
A property of two variables such that as the value of one vari- 
able increases, the value of the other increases as well. 


positive punishment (operant conditioning) <11> 
Another name for positive punishment. 


positive reinforcement (operant conditioning) <11> 
Adding a pleasant stimulus to a situation to increase the fre- 
quency of a desired behavior. 


posterior (Bayes’ Theorem) <4> 
The value that results from an application of Bayes’ Theorem. 


posterior probability (Bayes’ Theorem) <4> 
See posterior. 
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PPV (probability) <3> 
Positive predictive value. 


precision (probability) <3> 
The percent of samples that were properly labeled as true rela- 
tive to all the samples labeled as true. 


predicate (reasoning) <11> 
A description of objects of interest in a categorical syllogism. 


predicted value (classification) <7> 
The value produced by a learner in response to an input. 


prediction (deep learning) <1> 
The output of a learner in response to a given piece of input. 
For regression tasks, the prediction is usually a real number. 
For classification tasks, the prediction may be the most proba- 
ble class, or a list of probabilities, one for each possible class. 


prediction (reasoning) <11> 
A logical relationship between a sample set and the population 
it’s drawn from. If the sample set is representative of the pop- 
ulation, then any given property of the sample set will also be 
shared by the population. 


principal component analysis (autoencoders) <24> 
A mathematical technique for analyzing the samples of a 
data set to identify features that can be usefully combined or 
removed. 


prior (Bayes’ Theorem) <4> 
The probability of event A occurring, or P(A), used to find 
P(A|B). Informally, the starting prior explicitly encodes our 
beliefs and expectations about a system, allowing us to incor- 
porate our knowledge and experience into the calculation. 


prior probability (Bayes’ Theorem) <4> 
See prior. 


1723 


Chapter 30: Glossary 


private information (reinforcement learning) <25> 
Data held internally by the agent. 


probability (statistics) <2> 
An estimate of how likely it is that some statement is true. 


probability distribution (statistics) <2> 
The probability that a random variable will take on any of a 
collection of possible values. 


probability distribution function (statistics) <2> 
A function that describes the probability that a random vari- 
able will take on each of a collection of possible values. It is 
often abbreviated by the lower-case acronym pdf. 


probability mass function (statistics) <2> 
A probability distribution function that contains only a finite 
number of possible values. Sometimes written using the low- 
er-case acronym pmf. 


projection (data prep) <12> 
The act of reducing the dimensionality of a piece of data by 
replacing it with its nearest point on a given curve or surface. 


pruning (decision trees) <13> 
A technique to control overfitting by removing some leaves 
from a tree. 


pseudo-random numbers (statistics) <2> 
Numbers generated by an algorithm that are intended to be 
unpredictable in practice. Although examination of the algo- 
rithm will reveal exactly which number will be produced next, 
these values are intended to be sufficiently unpredictable, or 
random, for most practical purposes. 


punishment (reasoning) <11> 
In operant conditioning, adding an unpleasant stimulus to a 
situation to decrease the frequency of an undesired behavior. 
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purity (classifiers) <13> 
When discussing decision trees, a number that characterizes 
the similarity of samples in a node. 


PyCharm Community Edition IDE (Keras) <23> 
A free Python development environment. 


PyTorch (autoencoders) <24> 
An open-source library for deep learning. 


Q 


Q-learning (reinforcement learning) <25> 
An algorithm for reinforcement learning. The Q stands for 
“quality.” 


Q-table (reinforcement learning) <25> 
A table used by Q-learning to save a score for every action in 
every situation. 


Q-value (reinforcement learning) <25> 
A value saved in a Q-table. 


R 


random forest (ensembles) <14> 
A type of decision-tree ensemble where the splitting step on 
each tree is carried out by random feature selection and bag- 
ging, rather than by examining splits from all the features. 


random labeling (ensembles) <14> 
In a classifier, the process of randomly assigning labels to 
input values. 
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random numbers (statistics) <2> 
A sequence of numbers that cannot be predicted in advance. 
In practice, this requires measuring something in the real 
world, so we often use computer programs to produce pseu- 
do-random numbers, which are intended to be close enough 
to random for most practical uses. 


random search (scikit-learn) <15> 
When evaluating multiple hyperparameters, the approach of 
using randomly-chosen combinations of potential values at 
each step. 


random variable (statistics) <2> 
A function that takes as input a probability distribution, and 
returns a value drawn from that distribution. 


real number (statistics) <2> 
A number that may include a fractional part. The fraction may 
have an unlimited number of digits (such as 1/3 = 0.33333...). 


real-world (scikit-learn) <15> 
Data from physical measurement or observation. 


recall (probability) <3> 
The percentage of true statements that were correctly labeled. 


rectified linear unit (activation functions) <17> 
See ReLU. 


recurrent cell (deep learning) <20> 
Another name for a recurrent unit. 


recurrent network (deep learning) <20> 
A neural network that is dominated by recurrent layers. 


recurrent unit (deep learning) <20> 
A computational unit designed to handle sequential data. 
Typically, the unit has built-in memory that is able to adapt to 
incoming data even after training has been completed. 
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regression <1> 
The act of discovering a relationship between independent 
and dependent variables. In machine learning, a regression 
task usually involves predicting a real number for each piece 
of input. 


regular grid (scikit-learn) <15> 
When searching for hyperparameters, the approach of explor- 
ing every possible combination of potential values. 


regularization (overfitting) <9> 
An algorithm applied during learning to put off the onset of 
overfitting, or reduce the effect of overfitting. 


regularization term (backpropagation) <18> 
A piece of a network’s error function designed to promote reg- 
ularization. The strength of this term is often represented by 
the Greek letter 1 (lambda). 


reinforce (reasoning) <11> 
In behaviorism, attempting to cause an individual to perform 
more of a given action. 


reinforcement learning <1> 
A branch of machine learning where an agent takes actions 
that affect an environment, and receives back information, 
or rewards, from the environment, describing the quality of 
those actions. 


relative entropy (information theory) <6> 
See Kullback-Leibler divergence. 


release <1> 
An alternate name for deploy. 


relief (reasoning) <11> 
In operant conditioning, another name for negative 
reinforcement. 
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ReLU (activation functions) <17> 
A piecewise linear activation function that is 0 to the left of 0, 
and the identify function to the right of o. 


remembering (RNNs) <22> 
The process of incorporating some or all of the values in a list 
of numbers to the internal memory of an RNN cell. 


representation (reasoning) <11> 
The set of things that a system is capable of expressing, or 
“knowing.” 


representation blending (autoencoders) <24> 
See parametric blending. 


representational power <1> 
An alternate name for capacity. 


resampling (statistics) <2> 
Choosing elements from one set to make another set. This is a 
step in bootstrapping. 


reshaping layer (deep learning) <20> 
A helper layer that reshapes its input tensor. Typically, the 
number of elements in the new tensor must match the num- 
ber of elements in the input. 


reverse-mode automatic differentiation (backpropagation) <18> 
Another name for backpropagation. 


reward (operant conditioning) <11> 
Another name for positive reinforcement. 


reward (reinforcement learning) <11> 
The feedback from the environment to the agent, issued after 
an action to carry information about the new state of the 
environment. 


RMSprop (optimizers) <19> 
An optimization technique similar to Adagrad. 
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RNN (deep learning) <20> 
See recurrent neural network. 


RNN unit (RNNs) <22> 
See recurrent neural network. 


rolled-up (RNNs) <22> 
An RNN diagram where the sequential steps are not explicitly 
shown. 


root (decision trees) <13> 
The topmost node in a tree. The root has no parents. 


rotation validation (training) <8> 
An alternate name for cross-validation. 


rule-based system <1> 
Generally used as an alternate name for an expert system. 


: 


saddle (curves and surfaces) <5> 
A region where the value of a surface increases in one or more 
directions, while it simultaneously decreases in one or more 
other directions. 


sample <1> 
A collection of one or more pieces of information, called fea- 
tures. A data set is made up of one or more samples. 


sample set (statistics) <2> 
A set of values chosen from a population. Used in 
bootstrapping. 


sample-and-hold (feedforward networks) <16> 
A mechanism whereby data arriving at some point in a net- 
work is held there without change until a new value arrives, 
replacing it. 
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samplewise processing (data prep) <12> 
A transformation that modifies one complete sample at a time. 


SARSA <25> 
A reinforcement learning algorithm that adds an extra step to 
Q-learning in order to produce better Q-scores. 


saturation (activation functions) <17> 
A phenomenon of neuron activation functions like the sigmoid 
and tanh. These functions are flat, or nearly flat, for very large 
positive or negative values. This means that their derivative 
in those regions is O, or nearly o. In turn, this means that 
when we apply backpropagation and update, the changes to 
the neuron’s weights will be also be 0, or nearly 0. Because 
the weights stop changing, we often say that the neuron stops 
learning as a result. 


scikit-learn (scikit-learn) <15> 
An open-source machine learning library for Python. 


score (reinforcement learning) <25> 
An estimate for the value of a given action. 


score (training) <8> 
A general term for a value used to quantify the performance of 
an algorithm. 


search space (Keras) <23> 
The conceptual space that contains the parameters (or hyper- 
parameters) that a person or algorithm is searching, typically 
for the values that lead to the best performance of a learner. 


seed (statistics) <2> 
A starting value for a pseudo-random number generator. Used 
to force the generator to produce the same sequence of values 
repeatedly, which can be helpful when debugging. 
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selecting (RNNs) <22> 
The process of running an RNN unit’s internal memory 
through a gate before presenting them as output. 


selection with replacement (statistics) <2> 
See SWR. 


selection without replacement (statistics) <2> 
See SWOR. 


semi-supervised learning <1> 
A type of machine learning that shares characteristics of both 
supervised learning (where each sample has a label) and 
unsupervised learning (where no samples have labels). In 
semi-supervised learning, there may be a mix of labeled and 
unlabeled data. Alternatively, each sample may be interpreted 
as both data and its own label, as with autoencoders. 


sensitivity (probability) <3> 
An alternate name for recall. 


sentiment analysis (RNNs) <22> 
The process of analyzing a piece of writing to determine 
whether it is expressing a generally positive or negative opin- 
ion on a topic. 


Sequential API (Keras) <23> 
The collection of objects, methods, and functions in Keras 
designed for models whose architecture is a single list of 
layers. 


sequential data (Keras) <24> 
Data that has an inherent order. Often this refers to samples 
representing measurements taken over time. 


sequential operation (feedforward networks) <16> 
Performing a series of operations, one after the other. 
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SGD (optimizers) <19> 
See stochastic gradient descent. 


Shannon entropy (information theory) <6> 
See entropy. 


shared weights (CNNs) <21> 
The idea that a convolution layer uses the same values for a 
given filter every time that kernel is used. We can imagine 
that there are multiple copies of the filter, all sharing the same 
weights. 


shifted ReLU (activation functions) <17> 
A piecewise linear activation function that has the form of a 
ReLU function, but is shifted down and left. 


shuffling <1> 
Usually refers to changing the order of the samples in a data 
set prior to each epoch of training. 


sibling (decision trees) <13> 
A node at the same depth as a given node. 


sigma (o) (activation functions) <17> 
A symbol often used to represent the sigmoid activation func- 
tion. See sigmoid. 


sigmoid (activation functions) <17> 
An S-shaped curve. When used as an activation function, it 
is O for large negative values, 1 for large positive values, and 
smoothly transitions from oO to 1 in a neighborhood centered 
at o. This curve, and the function that defines it, are often 
referred to by the Greek letter o (sigma). 


sign (curves and surfaces) <5> 
A characteristic of a number. If the number is positive, its sign 
is 1. If the number is negative, its sign is -1. The sign for 0 is 
typically defined as either o or 1. 


1732 


Chapter 30: Glossary 


sign function (activation functions) <17> 
Given a number as input, return its sign. 


simple correlation (statistics) <2> 
A correlation between two variables. 


simple probability (probability) <3> 
An alternate name for marginal probability. 


single-valued (curves and surfaces) <5> 
The condition that for any given input value, there is only one 
possible output value of the curve or surface. Informally, the 
curve or surface cannot fold over itself to have more than one 
value above any given location. 


slothful induction (reasoning) <11> 


An induction fallacy where we ignore the most obvious or sim- 
ple conclusion in favor of an alternative. 


smooth (curves and surfaces) <5> 
A curve or surface that does not have any cusps. 


softmax (activation functions) <17> 
An algorithm that implements a mathematical transformation 
on a list of numbers. When used in deep learning classifiers, 
it is typically implemented inside a helper layer of its own, 
placed at the end of the network after a fully-connected layer. 
The input is a list of numbers, with one entry for each class to 
which the input can be assigned. The softmax function trans- 
forms these numbers into probabilities. 


softplus (activation functions) <17> 


A smooth and continuous activation function that is like a 
smoothed-out ReLU. 


sound (reasoning) <11> 


A syllogism that is accurate, and whose premises are true 
statements in a given context. 
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special pleading (reasoning) <11> 
A fallacy of logical induction that involves interpreting the 
results selectively, often with an appeal to authority. 


specifying (Keras) <23> 
The process of using a library to describe the architecture of a 
model. 


squashing (activation functions) <17> 
A common reference to the actions of activation functions like 
sigmoid and tanh, which compress their unbounded input val- 
ues to the range [0,1] or [—1,1] respectively. 


stair-step function (activation functions) <17> 
A discontinuous function that takes on different constant val- 
ues in different intervals. The intervals are usually fixed in 
width and each value is greater than the previous value by a 
fixed amount. 


standard deviation (statistics) <2> 
A value that describes the diversity in a set of values. A set of 
values that are close to their mean will have a smaller stan- 
dard deviation than a more spread-out set of values. 


standard form categorical syllogism (reasoning) <11> 
See categorical syllogism. 


standardization (data prep) <12> 
The process of transforming a set of numbers so that they 
have a mean of 0 and a standard deviation of 1. 


state (feedforward networks) <16> 
Information that describes the current configuration of a sys- 
tem, particularly the parts that may change over time. 


state (RNNs) <22> 
The internal memory of an RNN unit. 
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statistical syllogism (reasoning) <11> 
A type of syllogism that lets us reason from knowledge about 
the population to knowledge about an individual chosen from 
the population. 


step function (activation functions) <17> 
A discontinuous function that has one value to the left of a 
specific input value (the threshold), and another value to its 
right. 


stochastic <25> 
A synonym for random. 


stochastic gradient descent <25> 
An algorithm that performs gradient descent using samples 
that arrive in random order. In based SGD, backprop and 
update staps are applied after each sample is evaluated. 


stride (convolutional layers, CNNs) <21> 
The distance by which the filters in a convolutional layer are 
moved along the input tensor. When we think of the input 
tensor as an object with multiple channels, there is one value 
of the stride for each dimension of the object, ignoring the 
channels. For example, if the image is a 2D grayscale or color 
image, there are 2 values for the stride, describing the hori- 
zontal and vertical motion of the filter. 


style transfer (creative applications) <28> 
An algorithm for transforming an image so that it appears to 
have been created in the style of another image. 


strong learner (ensembles) <14> 
A learner that performs well when compared to other (weaker) 
learners. 


structural problem (data sets) <23> 
A problem with the organization of information inside a data 
set. 
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sub-tree (decision trees) <13> 
A piece of a larger tree. 


subject (reasoning) <11> 
The topic of interest in a categorical syllogism. 


subjective Bayes (Bayes’ Theorem) <4> 
A use of Bayes’ Theorem where we choose our prior using sub- 
jective, or personal, criteria. 


supervised learning <1> 
A type of machine learning where each sample has a label. 


support vector machine (classifiers) <13> 
A supervised learning algorithm that attempts to separate 
samples of different classes using a boundary that is as far as 
possible from all samples. 


support vectors (classifiers) <13> 
The samples that are used by a support vector machine to 
build a boundary. 


SVM (classifiers) <13> 
See support vector machine. 


swish (activation functions) <17> 
A small modification to the ReLU activation function where 
the sharp bend is replaced by a smooth and continuous curve. 
Unlike the ReLU, the derivative is defined for every input 
value. 


SWOR (statistics) <2> 
Sampling without replacement. Given a collection of elements, 
select one element (often at random) and remove it from the 
collection, so it cannot be chosen again. 
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SWR (statistics) <2> 
Sampling with replacement. Given a collection of elements, 
select one element (often at random) and leave it in the collec- 
tion, so it may be chosen again. 


syllogism (reasoning) <11> 
A template for a structured argument. The most common type 
is the categorical syllogism. 


syllogistic fallacy (reasoning) <11> 
An alternate name for an invalid syllogism. 


symbols (information theory) <6> 
The elements of a message. 


symmetry (CNNs) <21> 
A characteristic of 2 or more filters that share the same 
weights. 


symmetry (curves and surfaces) <5> 
A property of an object that does not change when a particu- 
lar transformation is applied to that object. For example, 2D 
vertical mirror symmetry means that a shape is not changed 
if it is flipped over a mirror line to exchange the left and right 
sides. 


symmetry breaking (CNNs) <21> 
The process of making sure that each filter in a convolutional 
layer has a unique set of weights. 


synchronous (feedforward networks) <16> 
Two or more actions happening simultaneously. In a network, 
this usually means that all nodes compute a new output at the 
same time. 


synthetic data (scikit-learn) <15> 
Samples created algorithmically. 
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T 


tangent (curves and surfaces) <5> 
With respect to a curve or surface, the line or surface that just 
touches at a given point. 


tanh (activation functions) <17> 
The hyperbolic tangent function, often used as an activation 
function. It is an S-shaped curve, with value —1 for large nega- 
tive inputs, 1 for large positive inputs, and a smooth transition 
from oO to 1in a neighborhood centered at o. 


target (backpropagation) <18> 
The manually-assigned value (usually a category or real num- 
ber) attached to each sample. We want the network to predict 
the target. 


tensor (deep learning) <20> 
A block of data with any number of dimensions. 


tensor processing unit (Keras) <23> 
A chip designed to accelerate the calculations involved in 
training and using deep learning models. 


TensorFlow (deep learning) <20> 
An open-source library for deep learning. 


terminal node (decision trees) <13> 
A node with no children. 


test data (training) <8> 
See test set. 


test error (backpropagation) <18> 
A measurement of the error made by a learner in predict- 
ing results for samples in the test set. This is often the error 
quoted for the system when it is deployed. 
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test set <1> 
Data held aside during training. It is used just once, when 
training is completed, to evaluate the quality of the trained 
model. 


TFR (reinforcement learning) <25> 
See total future reward. 


Theano (deep learning) <20> 
An open-source library for deep learning. 


threshold (neurons) <10> 
In piecewise-linear functions, the location of a discontinuity, 
or perhaps only a corner, where one linear section ends and 
the next begins. 


time stamp (feedforward networks) <16> 
A means for attaching a time to a piece of data. Usually this is 
the moment when the data is created or computed. 


time step (RNNs) <22> 
The potentially multiple values associated with each feature in 
a sample. The name comes about because the values are often, 
though not always, the result of measurements of something 
over time. 


top-down (reasoning) <11> 
Another name for deductive reasoning. 


Torch (deep learning) <20> 
An open-source library for deep learning. 


total future reward (reinforcement learning) <25> 
The sum of all rewards received by an agent from a given 
action to the end of the episode. 


total reward (reinforcement learning) <25> 
The sum of all of the rewards received by an agent during an 
episode. 
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TPU (Keras) <23> 
See tensor processing unit. 


train <1> 
To give an algorithm a set of data for it to learn from. 


training (training) <8> 
The process of exposing a learner to a set of samples called the 
training set. The learner uses those samples to adjust its inter- 
nal parameters so that it can represent the data in a way that 
allows it to do the task we’ve designed it for. 


training data (classification) <7> 
See training set. 


training error (overfitting) <9> 
A measurement of the error made by a learner in predicting 
results for samples in the training set. 


training loss (overfitting) <9> 
Another name for training error. 


training set (training) <8> 
The collection of data that is provided to a learner. The sam- 
ples may or may not be labeled. 


transfer function (activation functions) <17> 
Another name for an activation function. 


transfer learning (deep learning) <20> 
The process of adapting an existing deep learner for a new 
task. This can involve adding or removing layers, modifying 
layers, applying additional training, or any combination of 
these steps. 


transition (reinforcement learning) <25> 
The transformation of an object from one state to another. 


transposed convolution (CNNs) <21> 
Performing upsampling within a convolution layer. 
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triggered (reinforcement learning) <25> 
An agent or environment that stops working or changing when 
waiting for a signal from the other party. 


true negative (probability) <3> 
In a classification problem, a sample that was correctly classi- 
fied as false. 


true positive (probability) <3> 
In a classification problem, a sample that was correctly classi- 
fied as true. 


true positive rate (probability) <3> 
An alternate name for recall. 


U 


ultimate reward (reinforcement learning) <25> 
The last reward in an episode sent from the environment to 
the agent, often signaling success or failure of the agent’s prin- 
cipal goal. 


unbalanced (decision trees) <13> 
A decision tree with an asymmetrical shape. 


underfitting (overfitting) <9> 
The region during training when performance on both the 
training data and the validation data is increasing, and loss on 
both data sets is decreasing. 


undistributed middle (reasoning) <11> 
A syllogistic fallacy. In schematic form, it incorrectly asserts 
that because 1) All A are B, and 2) All C are B, therefore 3) All 
CareA. 


unfreeze (neural networks) <26> 
Allow the weights in a layer to be updated during training. 
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uniform distribution (statistics) <2> 
A distribution that has value 0 everywhere except a fixed 
region, where every input has the same output value. 


uniform initialization (neural networks) <16> 
Setting every weight in a neural network to the same value. 


uniform random distribution (feedforward networks) <16> 
A distribution that is zero everywhere except for a finite inter- 
val, where it has a constant value. 


unit (neurons) <10> 
An alternate name for an artificial neuron. 


unit step (activation functions) <17> 
A step function that is o to the left of the threshold, and 1 to 
the right. 


univariate transformation (data prep) <12> 
A transformation that collects statistics about just one feature 
of a data set, and uses those statistics to transform just that 
one feature. 


universal adversary (CNNs) <21> 
An image that is designed to cause many, or all, convnets to 
produce an incorrect answer. 


unrolled RNN (RNNs) <22> 
A drawing of an RNN showing each of its steps explicitly. 


unsound (reasoning) <11> 
A syllogism that is accurate, but has at least one premise that 
is not a true statement in a given context. 


unsupervised learning <1> 
A type of machine learning where input samples do not have 


labels. 
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update rule (reinforcement learning) <25> 


An algorithm for changing the value of an action’s score that is 
saved in a table of such scores. 


updating (deep learning) <20> 
The step after backpropagation that actually modifies weights, 
and thus implements learning. When backpropagation is used 
to find the gradient at each weight in a deep neural network, 
each weight is then updated, or evaluated and potentially 
assigned a new value intended to reduce the network’s error. 
This process is influenced by the learning rate. 


upsampling layer (deep learning) <20> 
A helper layer that usually considers the input to be an image 
of 1 or more channels, and scales up the width and height of 
that image, creating a new tensor that has more elements than 
the input. Typically, the elements are simply repeated as nec- 
essary, so the new image has width and height that are integer 
multiples of the input width and height. 


V 


VAE (autoencoders) <24> 
See variational autoencoder. 


valid (reasoning) <11> 


A categorical syllogism that correctly applies logic to derive its 
conclusion. 


validation error (overfitting) <9> 


The error made by a learner when evaluating the validation 
set. 


validation set (training) <8> 


A set of samples that a learner has not seen before, so we can 
use them to estimate its performance on new data. This set is 
usually created when performing cross-validation. 
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vanish (curves and surfaces) <5> 
With respect to a curve or surface, when the derivative or gra- 
dient goes to O we Say it has vanished. 


vanishing gradient (RNNs) <22> 
A problem that can occur during the training of an RNN, when 
the gradient at a particular weight decreases to oO. This pre- 
vents the weight from changing. 


variable-bitrate code (information theory) <6> 
A representation of a set of symbols such that each symbol is 
potentially represented by a different number of bits. Usually 
a probability distribution is consulted to assign smaller num- 
bers of bits to more frequent symbols. 


variance (statistics) <2> 


How widely spread-apart a set of numbers are from their 
mean. 


variance (curves and surfaces) <5> 
A property of the algorithm where the curves or surfaces are 
strongly influenced by the underlying data, and are therefore 
considerably different from one another. 


variance normalization (data prep) <12> 


Adjusting a set of numbers so that it has a standard deviation 
of 1. 


variational autoencoder (Keras) <23> 
A type of autoencoder that represents its latent variables as 
distributions. A variational autoencoder is able to generate 
new samples that are similar to, but different from, the input 
samples. 


vertex (feedforward networks) <16> 
Another name for a node in a network. 
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VGG16 (CNNs) <21> 
A convolutional neural network trained on 1.2 million images 
of household items and animals. The structure of the network, 
its weights, and its pre-processing steps are publicly available. 
VGG stands for “Visual Geometry Group.” The 16 refers to the 
16 layers of convolution in the network. 


VGG19 (CNNs) <21> 


A variant on VGG16 with one more convolution layer in each 
of the last 3 blocks. 


volume (deep learning) <20> 


A data structure formed as a 3D box. It can also be thought of 
as a 3D tensor. 


W 


weak learner (ensembles) <14> 
A learner that may be only slightly better than random. 


weight (neurons) <10> 
A value associated with an artificial neuron. Each input to the 
neuron has its own associated real number, or weight. Each 
input is multiplied by its corresponding weight, and then 
those products are added together. During the learning phase, 


the weights are adjusted to create the best performance from 
the network. 


weighted plurality voting (ensembles) <14> 
An election scheme like plurality voting, except that each vote 
is multiplied by a corresponding weight. The winner is the 
candidate with the largest sum of weighted votes. 


windowed sample (RNNs) <22> 
A piece of the input data, usually of a fixed size. 
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word2vec (RNNs) <22> 


An algorithm that converts words into sequences of numbers. 
Interpreting each sequence of numbers as a point in a multidi- 


mensional space, similar words are generally placed near one 
another in that space. 


wrapper (Keras) <23> 


An object that contains a Keras object, allowing its capabili- 


ties to be extended. For example, some wrappers let a Keras 
model be used by scikit-learn. 


X 


Xavier normal initialization (CNNs) <21> 


Another name for Glorot normal initialization. 


Xavier uniform initialization (CNNs) <21> 


Another name for Glorot uniform initialization. 


Z 


zero padding layer (CNNs) <21> 


A helper layer that usually considers the input to be an image 
of 1 or more channels, and adds elements with value o to the 
left, right, top, and/or bottom edges of the image. The result is 
a tensor with more elements than in the input. 


zero sum (GANs) <26> 


A type of game where players compete for a fixed set of 
resources. This means that once all resources have been 
claimed, further accumulation of resources by one player can 
only come by removing those resources from another. 
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